[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] signals when a job is killed



> On Jan 15, 2015, at 8:45 AM, Ben Cotton <ben.cotton@xxxxxxxxxxxxxxxxxx> wrote:
> 
> On Thu, Jan 15, 2015 at 7:10 AM, Krieger, Donald N. <kriegerd@xxxxxxxx> wrote:
> 
>> Is  there a signal sent?
>> If so, is it sent to the mother process of my job and which signal is used?
>> 
> By default, SIGTERM is used (except in the standard universe, where
> SIGSTP is the default).
> 
>> Is there a way to control which signal is used â it would be simplest to catch SIGINT, signal 2.
>> 
> Yes, the kill_sig command in your submit file can specify the
> (integer) signal used when a job is getting the boot.
> See: http://research.cs.wisc.edu/htcondor/manual/v8.2/condor_submit.html
> 
>> And finally, is the function which returns results of the jobs disabled when a job is killed?
>> 
> The answer to this is fuzzier. By default, jobs will get some time to
> clean up after themselves (I believe this is 30 seconds), but that's
> configurable by site administrators, so it may be longer or shorter
> than the default. You'll want to specify the following in your submit
> file, though:
> when_to_transfer_output = ON_EXIT_OR_EVICT
> 
>> From the manual:
> The ON_EXIT_OR_EVICT option is intended for fault tolerant jobs which
> periodically save their own state and can restart where they left off.
> In this case, files are spooled to the submit machine any time the job
> leaves a remote site, either because it exited on its own, or was
> evicted by the HTCondor system for any reason prior to job completion.
> The files spooled back are placed in a directory defined by the value
> of the SPOOL configuration variable. Any output files transferred back
> to the submit machine are automatically sent back out again as input
> files if the job restarts.
> 

Warning warning warning!

This may not do what you want.  If the job leaves the queue after an eviction (for example, if you have SYSTEM_PERIODIC_REMOVE set to remove jobs after a few attempts), your job output will be deleted from the spool.

Setting:

spool_on_evict = false

in your submit file will cause the output to be returned to your working directory on an eviction (as of 8.1.6).  Of course, if your job can restart from where it left off based on the intermediate files, this is *not* what you want.

Further, periodic_remove, periodic_hold, condor_rm, and condor_hold do *not* count as an eviction; your outputs would be lost in these cases.

See the discussions here:

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4292 (original implementation of spool_on_evict)
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4679 (follow-up ticket when we realized this wasn't exactly what we wanted; implementation should be in 8.3.3).

Brian