our problem consist in all our jobs going quickly from idle to held state with all our job logs telling :
000 (001.003.000) 06/05 14:07:02 Job submitted from host: <
10.9.185.29:38947> ... 001 (001.003.000) 06/05 14:17:11 Job executing on host: <10.9.185.211:42641> ... 007 (001.003.000) 06/05 14:17:11 Shadow exception!
Error from starter on licinfo11.xxx: STARTER at 10.9.185.211 failed to send file(s) to <10.9.185.29:60059>: error reading from /condor/licinfo11/execute/dir_9027/true: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <
10.9.185.211:53966> 0 - Run Bytes Sent By Job 8572 - Run Bytes Received By Job ... 012 (001.003.000) 06/05 14:17:11 Job was held. Error from starter on
licinfo11.xxx: STARTER at 10.9.185.211 failed to send file(s) to <10.9.185.29:60059>: error reading from /condor/licinfo11/execute/dir_9027/true: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <
10.9.185.211:53966> Code 13 Subcode 2 ...
i have LOCAL_DIR = /condor/$(HOSTNAME) previously had #LOCAL_DIR = $(RELEASE_DIR)/hosts/$(HOSTNAME)
changed it in order to have the local dir local to the nodes as i saw on the ml that remote local dirs could pose some problems if the machines weren't correctly time synchronised (our /home/condor is nfs shared amoung all our nodes)
additionnal info : all our UIDs are shared among our hosts
apparently condor don't manage to create the dirs in $(LOCAL_DIR)/execute (wich i chmoded to be world writable) to sed them back