Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Increasing shadow->schedd timeout
- Date: Mon, 22 Feb 2016 12:19:08 -0600
- From: Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx>
- Subject: [HTCondor-users] Increasing shadow->schedd timeout
Hello.
We are having issues with our network filesystem that causes
condor_schedd and condor_shadow to sometimes hang for long periods of
time (I suspect when they try to update job logs), which I think causes
unnecessary job restarts.
Is it possible to increase the timeout for shadow->schedd connections?
We are using 8.3.8.
ShadowLog contains entries like:
attempt to connect to <172.16.223.61:49753> failed: Connection timed out
(connect errno = 110). Will keep trying for 300 total seconds (237 to go).
attempt to connect to <172.16.223.61:49753> failed: Connection timed out
(connect errno = 110).
Can't connect to queue manager: CEDAR:6001:Failed to connect to
<172.16.223.61:49753>
SchedLog l
ERROR: Child pid 30961 appears hung! Killing it hard.
Shadow pid 30961 successfully killed because it was hung.
Shadow pid 30961 for job 101118639.0 exited with status 4
ERROR: Shadow exited with job exception code!
Thanks,
Vlad