Many thanks for this - it looks to be exactly what I was after. regards, -ian. From:
condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal On Tue, Jun 22, 2010 at 5:29 AM, Smith, Ian <I.C.Smith@xxxxxxxxxxxxxxx>
wrote: OK I think I see how to go about this now. How would I write the
Did you get this working? You have control over how long Condor waits for a job to checkpoint
itself when it wants to get the job off a machine. If your jobs have: +CheckpointJob = True And then: # Some helpful macros StateTimer =
(CurrentTime - EnteredCurrentState) # Preempt long running jobs PREEMPT = (ActivityTimer > 3600) # WANT_VACATE gets checked when PREEPT=True to see
if we should # vacate the job through a checkpointing call or
proceed directly to killing # the job. So move to Preempting/Vacating if this is
a check-pointable job WANT_VACATE = CheckpointJob =?= True # Move to the Preempting/Killing state after 30
seconds in Preempting/Vacating KILL = $(StateTimer) > 30 # And get real tough on things after another 30
seconds in the # Preempting/Killing state KILLING_TIMEOUT = 30 That's the approximate framework for things. Now you can tweak it to
suit your needs. Perhaps your jobs take a variable, but deterministic, amount
of time to vacate. In this case, if they supplied their estimated checkpointing
time with a job ad: +CheckpointTime = 120 You could try to reference it (I'm not sure this is 100% correct
TARGET. is always a tricky one to use): # If the job told us how to long to wait for it to
checkpoint use that. Otherwise use # the default of 30 seconds. KILL = (isUndefined(TARGET.CheckpointTime)
&& ($(StateTimer) > TARGET.CheckpointTime)) || ($(StateTimer) >
30) That needs to be verified. But it's a start. - Ian |