HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Bug in Checkpoint when WantCheckpoint is False



On Fri, Apr 01, 2005 at 01:55:16PM -0600, Daniel Forrest wrote:
> I have found a bug in Checkpoint() that occurs when a checkpoint is
> requested for a job that has 'WantCheckpoint = False'.
> 
> When such a job is to vacate, it is sent SIGTSTP (checkpoint and exit).
> The checkpoint will fail (because CKPT_MODE_ABORT is set), but the
> job does not exit.  It will then be killed 10 minutes (MaxVacateTime)
> later, but this is a waste of time.
> 
> The following patch addresses this.  Comments?

Your patch causes uinintended side effects and is rejected in its
current form.

The 'WantCheckpoint = False' attribute turns a checkpoint request
is a noop. Your code change would turn that noop into a fast vacate
automatically killing the job. The side effect becomes far more apparent
when PERIODIC_CHECKPOINT is set to true and the user job (which could
legitimately run on a machine to completion--as specified by an arbitrary
pool policy) receives checkpoint signals at regular intervals.

Currently, we do not have a method for differentiating between a
checkpoint signal that is periodic and one received when the job enters
preempting/vacating. 

So this can be implemented in terms of pool policy. Here is what I
*think* it appears you truly wanted: "checkpoint normally on periodic
checkpoints, but simply die if the checkpoint is because I'm vacating".
Here's how you'd implement that:

# Kill a job after ten minutes, or right away if it is a standard universe 
# job
KILL = ((CurrentTime - EnteredCurrentActivity) > 10 * 60)) || IsStandard

The above policy will hard kill any standard universe job IMMEDIATELY
open entering the Preempting/Vacating state, but allow normal 
periodic checkpoints to happen.

Thank you.

-pete