Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Trouble running jobs in the parallel universe

Date: Thu, 11 Feb 2010 12:05:32 -0500
From: Mark Visser <markv@xxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Trouble running jobs in the parallel universe

Can anyone reproduce? At this point, I don't know if it's aconfiguration issue or a legitimate bug.


thanks

Mark Visser wrote:

I have configured a scheduler as explained in the manual(http://www.cs.wisc.edu/condor/manual/v7.4/2_9Parallel_Applications.html).
I can submit the following job to the remote scheduler withcondor_submit -r <mysched> test.submit
universe = parallel
executable = /bin/env
machine_count = 4
output = output.$(Node).txt
queue
The job appears in the remote scheduler's queue, is matched (machinesappear in the RemoteHosts attribute of the job class ad), begins torun, then changes immediately to held.
According to -analyze:
> ...
> 026.000:  Request is held.
>
> Hold reason: Error from slot2@xxxxxxxxxxxxxxxxxxxxxxxxxx: Failed toopen '/var/spool/condor/spool/cluster26.proc0.subproc0/output.0.txt'as standard output: No such file or directory (errno 2)
rm1li025120:/var/spool/condor/spool/ is empty.

Here's the relevant part of the start log from that host:
> 02/08 13:57:07 slot2: Got activate_claim request from shadow(<192.168.100.85:34979>)
> 02/08 13:57:07 slot2: Remote job ID is 26.0
> 02/08 13:57:07 slot2: Got universe "PARALLEL" (11) from request classad
> 02/08 13:57:07 slot2: State change: claim-activation protocolsuccessful
> 02/08 13:57:07 slot2: Changing activity: Idle -> Busy
> 02/08 13:57:07 slot2: Called deactivate_claim_forcibly()
> 02/08 13:57:07 Starter pid 10463 exited with status 0
> 02/08 13:57:07 slot2: State change: starter exited
> 02/08 13:57:07 slot2: Changing activity: Busy -> Idle
> 02/08 13:57:07 condor_write(): Socket closed when trying to write 56bytes to <192.168.100.85:38432>, fd is 6
> 02/08 13:57:07 Buf::write(): condor_write() failed
> 02/08 13:57:07 slot2: Called deactivate_claim()

And the starter log:
> 02/08 13:57:07 ******************************************************
> 02/08 13:57:07 ** condor_starter (CONDOR_STARTER) STARTING UP
> 02/08 13:57:07 **/dfs1/net/studio/noarch/free/condor/condor-7.4.1/sbin/condor_starter> 02/08 13:57:07 ** SubsystemInfo: name=STARTER type=STARTER(8)class=DAEMON(1)> 02/08 13:57:07 ** Configuration: subsystem:STARTER local:<NONE>class:DAEMON
> 02/08 13:57:07 ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
> 02/08 13:57:07 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
> 02/08 13:57:07 ** PID = 10463
> 02/08 13:57:07 ** Log last touched 2/8 13:55:48
> 02/08 13:57:07 ******************************************************
> 02/08 13:57:07 Using config source: /home/condor/condor_config
> 02/08 13:57:07 Using local config sources:
> 02/08 13:57:07    /home/condor/config/condor_config.local.rm1li025120
> 02/08 13:57:07 DaemonCore: Command Socket at <192.168.25.120:36203>
> 02/08 13:57:07 Done setting resource limits
> 02/08 13:57:07 Communicating with shadow <192.168.100.85:56604>
> 02/08 13:57:07 Submitting machine is "netrender.lumierevfx.com"
> 02/08 13:57:07 setting the orig job name in starter
> 02/08 13:57:07 setting the orig job iwd in starter
> 02/08 13:57:07 Job has WantIOProxy=true
> 02/08 13:57:07 Initialized IO Proxy.
> 02/08 13:57:07 Job 26.0 set to execute immediately
> 02/08 13:57:07 Starting a PARALLEL universe job with ID: 26.0
> 02/08 13:57:07 IWD: /var/spool/condor/spool/cluster26.proc0.subproc0
> 02/08 13:57:07 Failed to open'/var/spool/condor/spool/cluster26.proc0.subproc0/output.0.txt' asstandard output: No such file or directory (errno 2)
> 02/08 13:57:07 Failed to open some/all of the std files...
> 02/08 13:57:07 Aborting OsProc::StartJob.
> 02/08 13:57:07 Failed to start job, exiting
> 02/08 13:57:07 ShutdownFast all jobs.
> 02/08 13:57:07 **** condor_starter (condor_STARTER) pid 10463EXITING WITH STATUS 0
If I comment out the "output = output.$(Node).txt" line, the job stillends up held, but this time with this error:>Hold reason: Error from slot1@xxxxxxxxxxxxxxxxxxxxxxxxxx: Failed toexecute '/var/spool/condor/spool/cluster27.proc0.subproc0/env': Nosuch file or directory
Adding "copy_to_spool = False" to the submit description makes nodifference.
As far as I can tell from the documentation, my submit descriptionshould work... any ideas?
thanks



--
Mark Visser, Software Director
Lumière VFX
Email: markv@xxxxxxxxxxxxxx
Phone: +1-514-316-1080 x3030

References:
- [Condor-users] Trouble running jobs in the parallel universe
  - From: Mark Visser

Prev by Date: Re: [Condor-users] Schedd & Negotiator not considering all jobs in the queue for scheduling
Next by Date: Re: [Condor-users] condor moves jobs to the other machines.
Previous by thread: [Condor-users] Trouble running jobs in the parallel universe
Next by thread: [Condor-users] startd hangs when using job hooks
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Trouble running jobs in the parallel universe