Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] result files upload problem
- Date: Mon, 17 Mar 2008 19:33:39 -0700
- From: "Pasquale Tricarico" <tricaric@xxxxxxx>
- Subject: Re: [Condor-users] result files upload problem
More on this:
007 (18811.000.000) 03/17 19:29:22 Shadow exception!
Assertion ERROR on (result)
56330838016 - Run Bytes Sent By Job
69064957952 - Run Bytes Received By Job
and:
/home/condor/log/ShadowLog:3/17 19:29:22 (18811.0) (23988):
condor_write(): Socket closed when trying to write 13 bytes to
<10.7.7.15:44456>, fd is 33
/home/condor/log/ShadowLog:3/17 19:29:22 (18811.0) (23988):
Buf::write(): condor_write() failed
/home/condor/log/ShadowLog:3/17 19:29:22 (18811.0) (23988): ERROR
"Assertion ERROR on (result)" at line 232 in file NTreceivers.C
Any idea?
Thanks,
Pasquale
On Mon, Mar 17, 2008 at 1:42 PM, Pasquale Tricarico <tricaric@xxxxxxx> wrote:
> Hi,
>
> In our cluster, we're having a problem during the upload of the result
> files from the running nodes to the cluster head node. The job is
> parallel, and runs otherwise fine, but when generating multi-GB files
> and copying them back at the end of the job, we get this on the job
> logfile:
>
> 022 (18800.000.000) 03/17 13:18:55 007 (18800.000.000) 03/17 13:18:55
> Shadow exception!
> JobDisconnectedEvent::writeEvent() called without startd_addr
> 0 - Run Bytes Sent By Job
> 69064941568 - Run Bytes Received By Job
>
> We're also monitoring the cluster with Ganglia, and the load on the
> headnode is OK until the results transfer period, when the load goes
> to over 10, and the head-node becomes mostly unresponsive. After about
> 20 min, all the jobs in the condor cluster go idle (Ganglia estimate),
> with the load of the head node still above 10. After we receive the
> shadow exception, a condor_q reveals all jobs in IDLE mode, even if
> they could still be running without problems because unrelated to this
> job. The shadow exception is emitted about 40 minutes after the job
> stops running on the nodes (Ganglia estimate), and the value of
> STARTER_UPLOAD_TIMEOUT = 3600 is currently used.
>
> Regards,
> Pasquale
>
> $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
>