[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Fwd: Shadow exception with LamMpi jobs



Mistakenly non-sent to list. Here's the forward.


---------- Forwarded message ----------
From: Pasquale Tricarico <tricaric@xxxxxxx>
Date: Fri, Mar 14, 2008 at 1:03 PM
Subject: Re: [Condor-users] Shadow exception with LamMpi jobs
To: Greg Thain <gthain@xxxxxxxxxxx>


Here's the StarterLog.slot1 on one failing node:

 3/14 12:19:40 ******************************************************
 3/14 12:19:40 ** condor_starter (CONDOR_STARTER) STARTING UP
 3/14 12:19:40 ** /usr/local/condor/sbin/condor_starter
 3/14 12:19:40 ** $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
 3/14 12:19:40 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
 3/14 12:19:40 ** PID = 1979
 3/14 12:19:40 ** Log last touched 3/14 11:13:55
 3/14 12:19:40 ******************************************************
 3/14 12:19:40 Using config source: /home/condor/condor_config
 3/14 12:19:40 Using local config sources:
 3/14 12:19:40    /home/condor/condor_config.local
 3/14 12:19:40 DaemonCore: Command Socket at <10.7.7.20:45987>
 3/14 12:19:40 Done setting resource limits
 3/14 12:19:41 Communicating with shadow <10.7.7.250:60766>
 3/14 12:19:41 Submitting machine is "head.psi.edu"
 3/14 12:19:41 setting the orig job name in starter
 3/14 12:19:41 setting the orig job iwd in starter
 3/14 12:19:41 Job has WantIOProxy=true
 3/14 12:19:41 Initialized IO Proxy.
 3/14 12:24:41 condor_read(): timeout reading 5 bytes from <10.7.7.250:60766>.
 3/14 12:24:41 IO: Failed to read packet header
 3/14 12:29:42 condor_read(): timeout reading 5 bytes from <10.7.7.250:60766>.
 3/14 12:29:42 IO: Failed to read packet header
 3/14 12:29:42 File transfer failed (status=0).
 3/14 12:29:42 ERROR "Failed to transfer files" at line 1810 in file
jic_shadow.C
 3/14 12:29:42 ShutdownFast all jobs.
 3/14 12:30:44 ******************************************************
 3/14 12:30:44 ** condor_starter (CONDOR_STARTER) STARTING UP
 3/14 12:30:44 ** /usr/local/condor/sbin/condor_starter
 3/14 12:30:44 ** $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
 3/14 12:30:44 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
 3/14 12:30:44 ** PID = 1991
 3/14 12:30:44 ** Log last touched 3/14 12:29:42
 3/14 12:30:44 ******************************************************
 3/14 12:30:44 Using config source: /home/condor/condor_config
 3/14 12:30:44 Using local config sources:
 3/14 12:30:44    /home/condor/condor_config.local
 3/14 12:30:44 DaemonCore: Command Socket at <10.7.7.20:52507>
 3/14 12:30:44 Done setting resource limits
 3/14 12:31:04 condor_read(): recv() returned -1, errno = 104, assuming
 failure reading 5 bytes from unknown source.
 3/14 12:31:04 IO: Failed to read packet header
 3/14 12:31:04 ERROR "Assertion ERROR on (result)" at line 207 in file
 NTsenders.C
 3/14 12:31:04 ERROR "LocalUserLog::logStarterError() called before
 init()" at line 223 in file local_user_log.C

 This is quite representative of what I see on many other StarterLogs.

 Pasquale



 On Fri, Mar 14, 2008 at 12:45 PM, Greg Thain <gthain@xxxxxxxxxxx> wrote:
 > Pasquale Tricarico wrote:
 >  > Thanks Greg for the suggestion. I've set into the config file
 >  > STARTER_UPLOAD_TIMEOUT = 3600, and then I've restarted condor and
 >  > submitted again, but the shadow exception is still present:
 >
 >  Hmm.  Can you send or upload the StarterLog from the relevant machines?
 >
 >  -Greg
 >
 >