Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Jobs restarting
- Date: Thu, 19 Nov 2015 17:18:29 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Jobs restarting
On 11/18/2015 5:28 AM, Peter Ellevseth wrote:
Hello all
We have some trouble with condor restarting our jobs. This happens when
there is some disturbance (backup job locking the disc) and the head
loses touch with the working nodes. I have two questions
1.How can I change the time it takes before the head node orders a
restart of a job.
If the submit machine fails to hear from the execute machine for more
than X seconds, where X is defined by JobLeaseDuration in the job's
submit file, then the job will be killed and restarted (potentially
someplace else).
By default, X is either 20 minutes or 40 minutes (depending on the
HTCondor version).
You can explicitly set it your job's submit file eg
executable = foo.exe
JobLeaseDuration = 3600
queue
Or you can specify a default in the condor_config file that
condor_submit will pick up and use, eg append in your condor_config
JobLeaseDuration = 3600
SUBMIT_EXPRS = $(SUBMIT_EXPRS) JobLeaseDuration
Some details in the Manual are at http://is.gd/ShifW8
2.Is it possible to change what is done when a restart is issued. Could
I, instead of condor sending a SIGKILL to the job, tell it to run a
script that shuts the job down safely?
I think Ben gave suggestions for this question in an earlier post...
It would be preferable to have
condor shut the job quietly down instead of restarting it.
Do you mean you don't want the job to restart? I.e. you want to run the
job once, and if there is a problem, have the job leave the queue
instead of restarting? If so, see the HOWTO at
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAvoidJobRestarts
Hope the above helps
Todd