Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Increasing shadow->schedd timeout
- Date: Mon, 22 Feb 2016 13:00:22 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Increasing shadow->schedd timeout
On 2/22/2016 12:19 PM, Vladimir Brik wrote:
Hello.
We are having issues with our network filesystem that causes
condor_schedd and condor_shadow to sometimes hang for long periods of
time (I suspect when they try to update job logs), which I think causes
unnecessary job restarts.
Is it possible to increase the timeout for shadow->schedd connections?
Yes. You will want to use knob SHADOW_NOT_RESPONDING_TIMEOUT. See the
below entries cut-n-pasted from section 3.3 of the HTCondor Manual.
best regards,
Todd
NOT_RESPONDING_TIMEOUT
When an HTCondor daemon's parent process is another HTCondor
daemon, the child daemon will periodically send a short message to its
parent stating that it is alive and well. If the parent does not hear
from the child for a while, the parent assumes that the child is hung,
kills the child, and restarts the child. This parameter controls how
long the parent waits before killing the child. It is defined in terms
of seconds and defaults to 3600 (1 hour). The child sends its alive and
well messages at an interval of one third of this value.
<SUBSYS>_NOT_RESPONDING_TIMEOUT
Identical to NOT_RESPONDING_TIMEOUT, but controls the timeout for a
specific type of daemon. For example, SCHEDD_NOT_RESPONDING_TIMEOUT
controls how long the condor_schedd's parent daemon will wait without
receiving an alive and well message from the condor_schedd before
killing it.