Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Fwd: Shadow exception with LamMpi jobs
- Date: Fri, 14 Mar 2008 14:51:38 -0700
- From: "Pasquale Tricarico" <tricaric@xxxxxxx>
- Subject: [Condor-users] Fwd: Shadow exception with LamMpi jobs
Mistakenly non-sent to list. Here's the forward.
---------- Forwarded message ----------
From: Pasquale Tricarico <tricaric@xxxxxxx>
Date: Fri, Mar 14, 2008 at 1:03 PM
Subject: Re: [Condor-users] Shadow exception with LamMpi jobs
To: Greg Thain <gthain@xxxxxxxxxxx>
Here's the StarterLog.slot1 on one failing node:
3/14 12:19:40 ******************************************************
3/14 12:19:40 ** condor_starter (CONDOR_STARTER) STARTING UP
3/14 12:19:40 ** /usr/local/condor/sbin/condor_starter
3/14 12:19:40 ** $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
3/14 12:19:40 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
3/14 12:19:40 ** PID = 1979
3/14 12:19:40 ** Log last touched 3/14 11:13:55
3/14 12:19:40 ******************************************************
3/14 12:19:40 Using config source: /home/condor/condor_config
3/14 12:19:40 Using local config sources:
3/14 12:19:40 /home/condor/condor_config.local
3/14 12:19:40 DaemonCore: Command Socket at <10.7.7.20:45987>
3/14 12:19:40 Done setting resource limits
3/14 12:19:41 Communicating with shadow <10.7.7.250:60766>
3/14 12:19:41 Submitting machine is "head.psi.edu"
3/14 12:19:41 setting the orig job name in starter
3/14 12:19:41 setting the orig job iwd in starter
3/14 12:19:41 Job has WantIOProxy=true
3/14 12:19:41 Initialized IO Proxy.
3/14 12:24:41 condor_read(): timeout reading 5 bytes from <10.7.7.250:60766>.
3/14 12:24:41 IO: Failed to read packet header
3/14 12:29:42 condor_read(): timeout reading 5 bytes from <10.7.7.250:60766>.
3/14 12:29:42 IO: Failed to read packet header
3/14 12:29:42 File transfer failed (status=0).
3/14 12:29:42 ERROR "Failed to transfer files" at line 1810 in file
jic_shadow.C
3/14 12:29:42 ShutdownFast all jobs.
3/14 12:30:44 ******************************************************
3/14 12:30:44 ** condor_starter (CONDOR_STARTER) STARTING UP
3/14 12:30:44 ** /usr/local/condor/sbin/condor_starter
3/14 12:30:44 ** $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
3/14 12:30:44 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
3/14 12:30:44 ** PID = 1991
3/14 12:30:44 ** Log last touched 3/14 12:29:42
3/14 12:30:44 ******************************************************
3/14 12:30:44 Using config source: /home/condor/condor_config
3/14 12:30:44 Using local config sources:
3/14 12:30:44 /home/condor/condor_config.local
3/14 12:30:44 DaemonCore: Command Socket at <10.7.7.20:52507>
3/14 12:30:44 Done setting resource limits
3/14 12:31:04 condor_read(): recv() returned -1, errno = 104, assuming
failure reading 5 bytes from unknown source.
3/14 12:31:04 IO: Failed to read packet header
3/14 12:31:04 ERROR "Assertion ERROR on (result)" at line 207 in file
NTsenders.C
3/14 12:31:04 ERROR "LocalUserLog::logStarterError() called before
init()" at line 223 in file local_user_log.C
This is quite representative of what I see on many other StarterLogs.
Pasquale
On Fri, Mar 14, 2008 at 12:45 PM, Greg Thain <gthain@xxxxxxxxxxx> wrote:
> Pasquale Tricarico wrote:
> > Thanks Greg for the suggestion. I've set into the config file
> > STARTER_UPLOAD_TIMEOUT = 3600, and then I've restarted condor and
> > submitted again, but the shadow exception is still present:
>
> Hmm. Can you send or upload the StarterLog from the relevant machines?
>
> -Greg
>
>