Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor problem : shadow unable to transmit output file

Date: Tue, 5 Jun 2007 15:08:17 +0200
From: "USTV_condor_Task_Force USTV_condor_Task_Force" <ustv.condor.task.force@xxxxxxxxx>
Subject: [Condor-users] condor problem : shadow unable to transmit output file

Hello , We are making a test grid in order to harness all our lab computer processing power
and we ran in a problem we are unable to solve.

our pool is for currently made out of
licinfo10.uni LINUX       INTEL Owner      Idle       0.000   502 0+00:10:02 - ubuntu edgy eft
licinfo11.uni LINUX       INTEL Owner      Idle       0.000   502 0+00:10:01 - ubuntu edgy eft
vm1@moua      LINUX       INTEL Owner      Idle       0.060   504 0+00:08:24 - RH FC 6
vm2@moua      LINUX       INTEL Owner      Idle       0.000   504 0+00:08:25
vm1@nocte     LINUX       INTEL Owner      Idle       0.270   506 0+00:10:09 - debian sid
vm2@nocte     LINUX       INTEL Owner      Idle       0.000   506 0+00:10:10
vm1@nous      LINUX       INTEL Owner      Idle       0.070   505 0+00:10:09 - ubuntu festy fawn
vm2@nous      LINUX       INTEL Owner      Idle       0.000   505 0+00:10:10

i tested a test submit i had on this ml :

executable = /bin/hostname
universe = vanilla
TransferExecutable = true
transfer_output_files= true
output=results.output.$(Process)
error=results.error.$(Process)
log=results.log.$(Process)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
queue 5

our problem consist in all our jobs going quickly from idle to held state with all our job logs telling :

000 (001.003.000) 06/05 14:07:02 Job submitted from host: < 10.9.185.29:38947>
...
001 (001.003.000) 06/05 14:17:11 Job executing on host: <10.9.185.211:42641>
...
007 (001.003.000) 06/05 14:17:11 Shadow exception!
        Error from starter on licinfo11.xxx: STARTER at 10.9.185.211 failed to send file(s) to <10.9.185.29:60059>: error reading from /condor/licinfo11/execute/dir_9027/true: (errno 2) No such file or directory; SHADOW failed to receive file(s) from < 10.9.185.211:53966>
        0 - Run Bytes Sent By Job
        8572 - Run Bytes Received By Job
...
012 (001.003.000) 06/05 14:17:11 Job was held.
        Error from starter on licinfo11.xxx: STARTER at 10.9.185.211 failed to send file(s) to <10.9.185.29:60059>: error reading from /condor/licinfo11/execute/dir_9027/true: (errno 2) No such file or directory; SHADOW failed to receive file(s) from < 10.9.185.211:53966>
        Code 13 Subcode 2
...

i have
LOCAL_DIR        = /condor/$(HOSTNAME)
previously had
#LOCAL_DIR        = $(RELEASE_DIR)/hosts/$(HOSTNAME)

changed it in order to have the local dir local to the nodes as i saw on the ml that remote local dirs could pose some problems if the machines weren't correctly time synchronised (our /home/condor is nfs shared amoung all our nodes)

additionnal info : all our UIDs are shared among our hosts

apparently condor don't manage to create the dirs in $(LOCAL_DIR)/execute (wich i chmoded to be world writable) to sed them back

Hope somebody can Help :)

The USTV Condor Task Force

Follow-Ups:
- Re: [Condor-users] condor problem : shadow unable to transmit output file
  - From: o c

Prev by Date: Re: [Condor-users] How to stop rerun
Next by Date: Re: [Condor-users] Problem with Group Accounting, another question...
Previous by thread: [Condor-users] Relative performance with and without significant attribute clustering for specific situations?
Next by thread: Re: [Condor-users] condor problem : shadow unable to transmit output file
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] condor problem : shadow unable to transmit output file