Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Increasing shadow->schedd timeout

Date: Mon, 22 Feb 2016 12:19:08 -0600
From: Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx>
Subject: [HTCondor-users] Increasing shadow->schedd timeout

Hello.

We are having issues with our network filesystem that causescondor_schedd and condor_shadow to sometimes hang for long periods oftime (I suspect when they try to update job logs), which I think causesunnecessary job restarts.

Is it possible to increase the timeout for shadow->schedd connections?We are using 8.3.8.


ShadowLog contains entries like:

attempt to connect to <172.16.223.61:49753> failed: Connection timed out(connect errno = 110). Will keep trying for 300 total seconds (237 to go).attempt to connect to <172.16.223.61:49753> failed: Connection timed out(connect errno = 110).Can't connect to queue manager: CEDAR:6001:Failed to connect to<172.16.223.61:49753>


SchedLog l
ERROR: Child pid 30961 appears hung! Killing it hard.
Shadow pid 30961 successfully killed because it was hung.
Shadow pid 30961 for job 101118639.0 exited with status 4
ERROR: Shadow exited with job exception code!




Thanks,

Vlad

Follow-Ups:
- Re: [HTCondor-users] Increasing shadow->schedd timeout
  - From: Todd Tannenbaum

Prev by Date: Re: [HTCondor-users] Condor and Ganglia
Next by Date: Re: [HTCondor-users] Increasing shadow->schedd timeout
Previous by thread: [HTCondor-users] How to recover node without draining/restarting
Next by thread: Re: [HTCondor-users] Increasing shadow->schedd timeout
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] Increasing shadow->schedd timeout