On Tue, Jun 22, 2010 at 5:29 AM, Smith, Ian
<I.C.Smith@xxxxxxxxxxxxxxx> wrote:
OK I think I see how to go about this now. How would I write the
PREEMPT _expression_ - presumably it would need to include
a WANT_VACATE==TRUE term (so that only jobs that save
their own checkpoints are vacated) and some way of determining
if the run time was greater than a given periodic checkpoint time
I guess this value could be supplied via a job classad ?).
Did you get this working?
You have control over how long Condor waits for a job to checkpoint itself when it wants to get the job off a machine.
If your jobs have:
+CheckpointJob = True
And then:
# Some helpful macros
StateTimer = (CurrentTime - EnteredCurrentState)
ActivityTimer = (CurrentTime - EnteredCurrentActivity)
# Preempt long running jobs
PREEMPT = (ActivityTimer > 3600)
# WANT_VACATE gets checked when PREEPT=True to see if we should
# vacate the job through a checkpointing call or proceed directly to killing
# the job. So move to Preempting/Vacating if this is a check-pointable job
WANT_VACATE = CheckpointJob =?= True
# Move to the Preempting/Killing state after 30 seconds in Preempting/Vacating
KILL = $(StateTimer) > 30
# And get real tough on things after another 30 seconds in the
# Preempting/Killing state
That's the approximate framework for things. Now you can tweak it to suit your needs. Perhaps your jobs take a variable, but deterministic, amount of time to vacate. In this case, if they supplied their estimated checkpointing time with a job ad:
+CheckpointTime = 120
You could try to reference it (I'm not sure this is 100% correct TARGET. is always a tricky one to use):
# If the job told us how to long to wait for it to checkpoint use that. Otherwise use
# the default of 30 seconds.
KILL = (isUndefined(TARGET.CheckpointTime) && ($(StateTimer) > TARGET.CheckpointTime)) || ($(StateTimer) > 30)
That needs to be verified. But it's a start.
- Ian