HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Bug in Checkpoint when WantCheckpoint is False



Peter,

> Your patch causes uinintended side effects and is rejected in its
> current form.
> 
> The 'WantCheckpoint = False' attribute turns a checkpoint request
> is a noop. Your code change would turn that noop into a fast vacate
> automatically killing the job. The side effect becomes far more
> apparent when PERIODIC_CHECKPOINT is set to true and the user job
> (which could legitimately run on a machine to completion--as
> specified by an arbitrary pool policy) receives checkpoint signals
> at regular intervals.

I don't understand why.  My patch only takes effect if SIGTSTP
(checkpoint and vacate) was received, it makes no difference when
SIGUSR2 (checkpoint only) is received.  Why would PERIODIC_CHECKPOINT
be sending SIGTSTP instead of SIGUSR2?

> Currently, we do not have a method for differentiating between a
> checkpoint signal that is periodic and one received when the job
> enters preempting/vacating.

Then I must be missing something because this is exactly what I
thought SIGUSR2 was used for compared to SIGTSTP.

> So this can be implemented in terms of pool policy. Here is what I
> *think* it appears you truly wanted: "checkpoint normally on
> periodic checkpoints, but simply die if the checkpoint is because
> I'm vacating".

No, I only want it to simply die if that particular job has
'WantCheckpoint = False'.  Your solution would prevent all checkpoints
on vacate.

-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
(608) 262 - 9479	University of Wisconsin, Madison