Re: [Condor-users] quick question: is periodic vacate possible

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Many thanks for this - it looks to be exactly what I was after.

regards,

-ian.

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
Sent: 24 June 2010 22:49
To: Condor-Users Mail List
Subject: Re: [Condor-users] quick question: is periodic vacate possible

On Tue, Jun 22, 2010 at 5:29 AM, Smith, Ian <I.C.Smith@xxxxxxxxxxxxxxx> wrote:

OK I think I see how to go about this now. How would I write the
PREEMPT _expression_ - presumably it would need to include
a WANT_VACATE==TRUE term (so that only jobs that save
their own checkpoints are vacated) and some way of determining
if the run time was greater than a given periodic checkpoint time

I guess this value could be supplied via a job classad ?).

Did you get this working?

You have control over how long Condor waits for a job to checkpoint itself when it wants to get the job off a machine.

If your jobs have:

+CheckpointJob = True

And then:

# Some helpful macros

StateTimer = (CurrentTime - EnteredCurrentState)
ActivityTimer = (CurrentTime - EnteredCurrentActivity)

# Preempt long running jobs

PREEMPT = (ActivityTimer > 3600)

# WANT_VACATE gets checked when PREEPT=True to see if we should

# vacate the job through a checkpointing call or proceed directly to killing

# the job. So move to Preempting/Vacating if this is a check-pointable job

WANT_VACATE = CheckpointJob =?= True

# Move to the Preempting/Killing state after 30 seconds in Preempting/Vacating

KILL = $(StateTimer) > 30

# And get real tough on things after another 30 seconds in the

# Preempting/Killing state

KILLING_TIMEOUT = 30

That's the approximate framework for things. Now you can tweak it to suit your needs. Perhaps your jobs take a variable, but deterministic, amount of time to vacate. In this case, if they supplied their estimated checkpointing time with a job ad:

+CheckpointTime = 120

You could try to reference it (I'm not sure this is 100% correct TARGET. is always a tricky one to use):

# If the job told us how to long to wait for it to checkpoint use that. Otherwise use

# the default of 30 seconds.

KILL = (isUndefined(TARGET.CheckpointTime) && ($(StateTimer) > TARGET.CheckpointTime)) || ($(StateTimer) > 30)

That needs to be verified. But it's a start.

- Ian

Mailing List Archives

Authenticated access

Re: [Condor-users] quick question: is periodic vacate possible