Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Large number of shadow exceptions due to Connection time out
- Date: Mon, 22 Nov 2010 09:43:29 +0100
- From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
- Subject: [Condor-users] Large number of shadow exceptions due to Connection time out
Hi all
we are currently seeing a large number of shadows dying due to connection time
outs. These are almost certainly caused by our network having a couple of
issues right now, however, is there any setting we can tell Condor or the
Linux kernel to mitigate this issue a bit as a short time solution before we
can weed out the networking problems at its root?
Messages are like this DAG with standard universe jobs:
000 (28889363.000.000) 11/15 23:19:55 Job submitted from host:
<10.20.30.2:51388>
DAG Node: A9630
001 (28889363.000.000) 11/15 23:20:27 Job executing on host:
<10.10.4.85:36245>
007 (28889363.000.000) 11/15 23:24:22 Shadow exception!
Unable to talk to job: Connection timed out
281 - Run Bytes Sent By Job
9824800 - Run Bytes Received By Job
001 (28889363.000.000) 11/15 23:24:33 Job executing on host: <10.10.9.8:43219>
007 (28889363.000.000) 11/15 23:33:32 Shadow exception!
Unable to talk to job: Connection timed out
313 - Run Bytes Sent By Job
9824832 - Run Bytes Received By Job
001 (28889363.000.000) 11/15 23:33:44 Job executing on host:
<10.10.0.90:51334>
I'm currently looking into Linux's TCP settings to mitigate this, but any
advice would be helpful!
Cheers and TALIA
Carsten
--
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
Phone/Fax: +49 511 762-17185 / -17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6
CaCert Assurer | Get free certificates from http://www.cacert.org/