[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] MaxJobRetirementTime and Preemption problem



Hi all,

I have a cluser of dedicated windows boxes that we have condor 7.2.4  running on. We have the boxes partitioned into 3 groups, for priortizing short, medium and long jobs (thus some machines will prefer short jobs, others medium ect) using the rank expression.  These jobs are expected to be able to finish in a certain amount of time (short jobs 8 hours or less, medium jobs within 16 hours, long jobs within 36 hours). Everything works as long as the users are running different types of jobs (short/medium/long). However, if I have user A running a bunch of short jobs, and then user B, whom hasn't been running jobs recently and thus has a higher priority, comes along and launches more short jobs we have a problem. What currently happens is the jobs are immediately preempted and vacated/killed. What we want to have happen is user A's jobs are preempted by user B, but the machine gives user A's currently running jobs a chance to complete (so for short jobs 8 hours since the time they started running). My understanding was that the MaxJobRetirementTime variable would allow the machine to go into retirement and give the jobs the specified amount of time to complete. We set this for each machine, copied it over, did a condor_reconfig and condor_restart (even restarted some of the machines) but the jobs are still immediately preempted and killed. Any insight would be most helpful. The relevent part of the config file is below. Also the MaxJobRetirementTime also includes a part such that if the job the machine is running isn't they type of job it prefers, and the job wanting to preempt isn't the type of job it prefers, then it will give the amount of time necessary for the currently running job to complete, though this doesn't appear to work either. 

Thanks for any help,

Dustin

**********************************************************************************************

RANK			= (JOB_LENGTH == "Short") 





#####################################################################

##  This where you choose the configuration that you would like to

##  use.  It has no defaults so it must be defined.  We start this

##  file off with the UWCS_* policy.

######################################################################



##  Also here is what is referred to as the TESTINGMODE_*, which is

##  a quick hardwired way to test Condor with a simple no-preemption policy.

##  Replace UWCS_* with TESTINGMODE_* if you wish to do testing mode.

##  For example:

##  WANT_SUSPEND 		= $(UWCS_WANT_SUSPEND)

##  becomes

##  WANT_SUSPEND 		= $(TESTINGMODE_WANT_SUSPEND)



# When should we only consider SUSPEND instead of PREEMPT?

WANT_SUSPEND = FALSE



# When should we preempt gracefully instead of hard-killing?

WANT_VACATE = FALSE



##  When is this machine willing to start a job? 

START = TRUE



##  When should a local universe job be allowed to start?

START_LOCAL_UNIVERSE	= True

# Only start a local universe jobs if there are less

# than 100 local jobs currently running

#START_LOCAL_UNIVERSE	= TotalLocalJobsRunning < 100



##  When should a scheduler universe job be allowed to start?

START_SCHEDULER_UNIVERSE	= True

# Only start a scheduler universe jobs if there are less

# than 100 scheduler jobs currently running

#START_SCHEDULER_UNIVERSE	= TotalSchedulerJobsRunning < 100



##  When to suspend a job?

SUSPEND = FALSE



##  When to resume a suspended job?

CONTINUE		= $(UWCS_CONTINUE)



##  When to nicely stop a job?

##  (as opposed to killing it instantaneously)

PREEMPT = FALSE

PREF_JOB_LENGTH = "Short"



##  When to instantaneously kill a preempting job

##  (e.g. if a job is in the pre-empting stage for too long)

KILL			= $(UWCS_KILL)



PERIODIC_CHECKPOINT	= $(UWCS_PERIODIC_CHECKPOINT)

PREEMPTION_REQUIREMENTS	= $(UWCS_PREEMPTION_REQUIREMENTS)

PREEMPTION_RANK		= $(UWCS_PREEMPTION_RANK)

NEGOTIATOR_PRE_JOB_RANK = $(UWCS_NEGOTIATOR_PRE_JOB_RANK)

NEGOTIATOR_POST_JOB_RANK = $(UWCS_NEGOTIATOR_POST_JOB_RANK)

MaxJobRetirementTime    = $(HOUR) * 8 * (MY.JOB_LENGTH == "Short") + ( ( (MY.PreemptingRank < 1) && (PreemptingRank < 1) && (MY.PreemptingRank =!= UNDEFINED) ) * ( ( (MY.JOB_LENGTH == "Medium") * 16 * $(HOUR) ) + ( (MY.JOB_LENGTH == "Long") * 36 * $(HOUR) ) ) ) 

CLAIM_WORKLIFE          = $(UWCS_CLAIM_WORKLIFE)

PRIORITY_HALFLIFE	= 28800