Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Jobs restarting
- Date: Fri, 27 Nov 2015 13:28:43 +0000
- From: Peter Ellevseth <Peter.Ellevseth@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] Jobs restarting
Great stuff, I will try this as well as Bens suggestion.
Thank you,
Peter
-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum
Sent: 20. november 2015 00:18
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Jobs restarting
On 11/18/2015 5:28 AM, Peter Ellevseth wrote:
> Hello all
>
> We have some trouble with condor restarting our jobs. This happens
> when there is some disturbance (backup job locking the disc) and the
> head loses touch with the working nodes. I have two questions
>
> 1.How can I change the time it takes before the head node orders a
> restart of a job.
If the submit machine fails to hear from the execute machine for more than X seconds, where X is defined by JobLeaseDuration in the job's submit file, then the job will be killed and restarted (potentially someplace else).
By default, X is either 20 minutes or 40 minutes (depending on the HTCondor version).
You can explicitly set it your job's submit file eg
executable = foo.exe
JobLeaseDuration = 3600
queue
Or you can specify a default in the condor_config file that condor_submit will pick up and use, eg append in your condor_config
JobLeaseDuration = 3600
SUBMIT_EXPRS = $(SUBMIT_EXPRS) JobLeaseDuration
Some details in the Manual are at http://is.gd/ShifW8
>
> 2.Is it possible to change what is done when a restart is issued.
> Could I, instead of condor sending a SIGKILL to the job, tell it to
> run a script that shuts the job down safely?
I think Ben gave suggestions for this question in an earlier post...
> It would be preferable to have
> condor shut the job quietly down instead of restarting it.
>
Do you mean you don't want the job to restart? I.e. you want to run the job once, and if there is a problem, have the job leave the queue instead of restarting? If so, see the HOWTO at https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAvoidJobRestarts
Hope the above helps
Todd
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/