[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] some jobs die at daemon restart, some don't

Date: Sat, 06 Feb 2016 21:28:09 -0600
From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] some jobs die at daemon restart, some don't

Hi Chris,

Unfortunately, when the startd is restarted, all vanilla universe jobs are lost.

Typically disruptive restarts can be avoided by:
1) condor_reconfig to pick up new settings.  Almost all configuration changes can be ingested without restart.
2) Draining the node (condor_off -peaceful) of jobs, then restart.
  - This can be done centrally in order to do a rolling restart of the cluster.
3) I havenât tried it myself, but perhaps the Docker universe would reconnect?  Greg would know...

Restarts of the schedd and collector/negotiator shouldnât affect running jobs.

Brian

> On Feb 4, 2016, at 7:29 AM, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
> 
> 
> Hi,
> 
> when restarting the condor daemon on a workernode most of the time the jobs on that node survive sometimes though the job dies, I presume that is the case when the job is actually writing to the shadow (?)
> 
> Is there a timeout or something alike that I can increase to keep all jobs happy during a daemon restart ? 
> 
> cheers
>        ~chris
> 
> 
> -- 
> /*   Christoph Beyer     |   Office: Building 2b / 23     *\
> *   DESY                |    Phone: 040-8998-2317        *
> *   - IT -              |      Fax: 040-8994-2317        *
> \*   22603 Hamburg       |     http://www.desy.de         */
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

References:
- [HTCondor-users] some jobs die at daemon restart, some don't
  - From: Beyer, Christoph

Prev by Date: Re: [HTCondor-users] Troubleshooting job eviction by machine RANK
Next by Date: [HTCondor-users] Submitting to a remote condor queue
Previous by thread: Re: [HTCondor-users] some jobs die at daemon restart, some don't
Next by thread: [HTCondor-users] Requirement about network interface speed
Index(es):
- Date
- Thread