Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Shadow exception with LamMpi jobs
- Date: Fri, 14 Mar 2008 12:35:48 -0700
- From: "Pasquale Tricarico" <tricaric@xxxxxxx>
- Subject: Re: [Condor-users] Shadow exception with LamMpi jobs
Thanks Greg for the suggestion. I've set into the config file
STARTER_UPLOAD_TIMEOUT = 3600, and then I've restarted condor and
submitted again, but the shadow exception is still present:
000 (18790.000.000) 03/14 12:18:30 Job submitted from host: <10.7.7.250:38867>
...
014 (18790.000.014) 03/14 12:22:48 Node 14 executing on host: <10.7.7.14:39641>
...
014 (18790.000.021) 03/14 12:22:48 Node 21 executing on host: <10.7.7.14:39641>
...
014 (18790.000.010) 03/14 12:24:12 Node 10 executing on host: <10.7.7.13:51425>
...
014 (18790.000.019) 03/14 12:24:12 Node 19 executing on host: <10.7.7.14:39641>
...
014 (18790.000.017) 03/14 12:24:55 Node 17 executing on host: <10.7.7.14:39641>
...
014 (18790.000.013) 03/14 12:24:56 Node 13 executing on host: <10.7.7.13:51425>
...
014 (18790.000.015) 03/14 12:26:18 Node 15 executing on host: <10.7.7.13:51425>
...
014 (18790.000.016) 03/14 12:27:07 Node 16 executing on host: <10.7.7.13:51425>
...
014 (18790.000.026) 03/14 12:27:50 Node 26 executing on host: <10.7.7.17:36396>
...
014 (18790.000.029) 03/14 12:28:35 Node 29 executing on host: <10.7.7.17:36396>
...
014 (18790.000.031) 03/14 12:29:26 Node 31 executing on host: <10.7.7.17:36396>
...
014 (18790.000.001) 03/14 12:30:14 Node 1 executing on host: <10.7.7.11:58617>
...
007 (18790.000.000) 03/14 12:30:14 Shadow exception!
Error from starter on slot1@xxxxxxxxxxxxxx: Failed to transfer files
0 - Run Bytes Sent By Job
30215708672 - Run Bytes Received By Job
Pasquale
> Try setting in the config file
>
> STARTER_UPLOAD_TIMEOUT = 1200
>
> or set it to another large value, and see if the problem goes away.