[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor problem : shadow unable to transmit output file
- Date: Tue, 5 Jun 2007 15:08:17 +0200
- From: "USTV_condor_Task_Force USTV_condor_Task_Force" <ustv.condor.task.force@xxxxxxxxx>
- Subject: [Condor-users] condor problem : shadow unable to transmit output file
Hello , We are making a test grid in order to harness all our lab computer processing power
and we ran in a problem we are unable to solve.
our pool is for currently made out of
licinfo10.uni LINUX INTEL Owner Idle
0.000 502 0+00:10:02 - ubuntu edgy eft
licinfo11.uni LINUX INTEL Owner Idle 0.000 502 0+00:10:01 - ubuntu edgy eft
vm1@moua LINUX INTEL Owner Idle 0.060 504 0+00:08:24 - RH FC 6
vm2@moua LINUX INTEL Owner Idle 0.000 504 0+00:08:25
vm1@nocte LINUX INTEL Owner Idle 0.270 506 0+00:10:09 - debian sid
vm2@nocte LINUX INTEL Owner Idle
0.000 506 0+00:10:10
vm1@nous LINUX INTEL Owner Idle 0.070 505 0+00:10:09 - ubuntu festy fawn
vm2@nous LINUX INTEL Owner Idle 0.000 505 0+00:10:10
i tested a test submit i had on this ml :
executable = /bin/hostname
universe = vanilla
TransferExecutable = true
transfer_output_files= true
output=results.output.$(Process)
error=results.error.$(Process)
log=results.log.$(Process)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
queue 5
our problem consist in all our jobs going quickly from idle to held state with all our job logs telling :
000 (001.003.000) 06/05 14:07:02 Job submitted from host: <
10.9.185.29:38947>
...
001 (001.003.000) 06/05 14:17:11 Job executing on host: <10.9.185.211:42641>
...
007 (001.003.000) 06/05 14:17:11 Shadow exception!
Error from starter on licinfo11.xxx: STARTER at 10.9.185.211 failed to send file(s) to <10.9.185.29:60059>: error reading from /condor/licinfo11/execute/dir_9027/true: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <
10.9.185.211:53966>
0 - Run Bytes Sent By Job
8572 - Run Bytes Received By Job
...
012 (001.003.000) 06/05 14:17:11 Job was held.
Error from starter on
licinfo11.xxx: STARTER at 10.9.185.211 failed to send file(s) to <10.9.185.29:60059>: error reading from /condor/licinfo11/execute/dir_9027/true: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <
10.9.185.211:53966>
Code 13 Subcode 2
...
i have
LOCAL_DIR = /condor/$(HOSTNAME)
previously had
#LOCAL_DIR = $(RELEASE_DIR)/hosts/$(HOSTNAME)
changed it in order to have the local dir local to the nodes as i saw on the ml that remote local dirs could pose some problems if the machines weren't correctly time synchronised (our /home/condor is nfs shared amoung all our nodes)
additionnal info : all our UIDs are shared among our hosts
apparently condor don't manage to create the dirs in $(LOCAL_DIR)/execute (wich i chmoded to be world writable) to sed them back
Hope somebody can Help :)
The USTV Condor Task Force