[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-devel] Bug in Checkpoint when WantCheckpoint is False
- Date: Wed, 6 Apr 2005 16:44:42 -0500
- From: Peter Keller <psilord@xxxxxxxxxxx>
- Subject: Re: [Condor-devel] Bug in Checkpoint when WantCheckpoint is False
On Fri, Apr 01, 2005 at 01:55:16PM -0600, Daniel Forrest wrote:
> I have found a bug in Checkpoint() that occurs when a checkpoint is
> requested for a job that has 'WantCheckpoint = False'.
>
> When such a job is to vacate, it is sent SIGTSTP (checkpoint and exit).
> The checkpoint will fail (because CKPT_MODE_ABORT is set), but the
> job does not exit. It will then be killed 10 minutes (MaxVacateTime)
> later, but this is a waste of time.
>
> The following patch addresses this. Comments?
Your patch causes uinintended side effects and is rejected in its
current form.
The 'WantCheckpoint = False' attribute turns a checkpoint request
is a noop. Your code change would turn that noop into a fast vacate
automatically killing the job. The side effect becomes far more apparent
when PERIODIC_CHECKPOINT is set to true and the user job (which could
legitimately run on a machine to completion--as specified by an arbitrary
pool policy) receives checkpoint signals at regular intervals.
Currently, we do not have a method for differentiating between a
checkpoint signal that is periodic and one received when the job enters
preempting/vacating.
So this can be implemented in terms of pool policy. Here is what I
*think* it appears you truly wanted: "checkpoint normally on periodic
checkpoints, but simply die if the checkpoint is because I'm vacating".
Here's how you'd implement that:
# Kill a job after ten minutes, or right away if it is a standard universe
# job
KILL = ((CurrentTime - EnteredCurrentActivity) > 10 * 60)) || IsStandard
The above policy will hard kill any standard universe job IMMEDIATELY
open entering the Preempting/Vacating state, but allow normal
periodic checkpoints to happen.
Thank you.
-pete