I have configured a scheduler as explained in the manual
(http://www.cs.wisc.edu/condor/manual/v7.4/2_9Parallel_Applications.html).
I can submit the following job to the remote scheduler with
condor_submit -r <mysched> test.submit
universe = parallel
executable = /bin/env
machine_count = 4
output = output.$(Node).txt
queue
The job appears in the remote scheduler's queue, is matched (machines
appear in the RemoteHosts attribute of the job class ad), begins to
run, then changes immediately to held.
According to -analyze:
> ...
> 026.000: Request is held.
>
> Hold reason: Error from slot2@xxxxxxxxxxxxxxxxxxxxxxxxxx: Failed to
open '/var/spool/condor/spool/cluster26.proc0.subproc0/output.0.txt'
as standard output: No such file or directory (errno 2)
rm1li025120:/var/spool/condor/spool/ is empty.
Here's the relevant part of the start log from that host:
> 02/08 13:57:07 slot2: Got activate_claim request from shadow
(<192.168.100.85:34979>)
> 02/08 13:57:07 slot2: Remote job ID is 26.0
> 02/08 13:57:07 slot2: Got universe "PARALLEL" (11) from request classad
> 02/08 13:57:07 slot2: State change: claim-activation protocol
successful
> 02/08 13:57:07 slot2: Changing activity: Idle -> Busy
> 02/08 13:57:07 slot2: Called deactivate_claim_forcibly()
> 02/08 13:57:07 Starter pid 10463 exited with status 0
> 02/08 13:57:07 slot2: State change: starter exited
> 02/08 13:57:07 slot2: Changing activity: Busy -> Idle
> 02/08 13:57:07 condor_write(): Socket closed when trying to write 56
bytes to <192.168.100.85:38432>, fd is 6
> 02/08 13:57:07 Buf::write(): condor_write() failed
> 02/08 13:57:07 slot2: Called deactivate_claim()
And the starter log:
> 02/08 13:57:07 ******************************************************
> 02/08 13:57:07 ** condor_starter (CONDOR_STARTER) STARTING UP
> 02/08 13:57:07 **
/dfs1/net/studio/noarch/free/condor/condor-7.4.1/sbin/condor_starter
> 02/08 13:57:07 ** SubsystemInfo: name=STARTER type=STARTER(8)
class=DAEMON(1)
> 02/08 13:57:07 ** Configuration: subsystem:STARTER local:<NONE>
class:DAEMON
> 02/08 13:57:07 ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
> 02/08 13:57:07 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
> 02/08 13:57:07 ** PID = 10463
> 02/08 13:57:07 ** Log last touched 2/8 13:55:48
> 02/08 13:57:07 ******************************************************
> 02/08 13:57:07 Using config source: /home/condor/condor_config
> 02/08 13:57:07 Using local config sources:
> 02/08 13:57:07 /home/condor/config/condor_config.local.rm1li025120
> 02/08 13:57:07 DaemonCore: Command Socket at <192.168.25.120:36203>
> 02/08 13:57:07 Done setting resource limits
> 02/08 13:57:07 Communicating with shadow <192.168.100.85:56604>
> 02/08 13:57:07 Submitting machine is "netrender.lumierevfx.com"
> 02/08 13:57:07 setting the orig job name in starter
> 02/08 13:57:07 setting the orig job iwd in starter
> 02/08 13:57:07 Job has WantIOProxy=true
> 02/08 13:57:07 Initialized IO Proxy.
> 02/08 13:57:07 Job 26.0 set to execute immediately
> 02/08 13:57:07 Starting a PARALLEL universe job with ID: 26.0
> 02/08 13:57:07 IWD: /var/spool/condor/spool/cluster26.proc0.subproc0
> 02/08 13:57:07 Failed to open
'/var/spool/condor/spool/cluster26.proc0.subproc0/output.0.txt' as
standard output: No such file or directory (errno 2)
> 02/08 13:57:07 Failed to open some/all of the std files...
> 02/08 13:57:07 Aborting OsProc::StartJob.
> 02/08 13:57:07 Failed to start job, exiting
> 02/08 13:57:07 ShutdownFast all jobs.
> 02/08 13:57:07 **** condor_starter (condor_STARTER) pid 10463
EXITING WITH STATUS 0
If I comment out the "output = output.$(Node).txt" line, the job still
ends up held, but this time with this error:
>Hold reason: Error from slot1@xxxxxxxxxxxxxxxxxxxxxxxxxx: Failed to
execute '/var/spool/condor/spool/cluster27.proc0.subproc0/env': No
such file or directory
Adding "copy_to_spool = False" to the submit description makes no
difference.
As far as I can tell from the documentation, my submit description
should work... any ideas?
thanks