[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Socket between submit and execute hosts closed unexpectedly



On 5/23/2014 11:29 AM, Szabolcs Horvátth wrote:
Hi,

When the socket between the submit and execute hosts are terminated and
they can't reconnect which attributes specify how many times the
connection is retried and how long is the delay between the tests?
Is it JobLeaseDuration and MAX_CLAIM_ALIVES_MISSED?

The condor_shadow process on the submit host will keep trying to 
reconnect to the condor_starter on the execute host for the number of 
seconds specified by job_lease_duration in the submit file; the default 
is 20 minutes.  See http://goo.gl/XxqlN5
As for how long is the delay between attempts, that is controlled via 
the (undocumented!) condor_config knobs RECONNECT_BACKOFF_CEILING 
(default value of 300) and RECONNECT_BACKOFF_FACTOR (default value of 
2.0).  The shadow will try to reconnect immediately, and then will do an 
exponential backoff as specified by the backoff factor until it reaches 
the ceiling value (try again after 4 second delay, then 8, 16, 32, ..., 
until it reaches 300 seconds, at which points it will try every 300 
seconds until the job_lease_duration expires).
hope the above helps,
Todd