Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Preemption and Priority Issue
- Date: Wed, 13 Jul 2011 14:22:44 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Preemption and Priority Issue
Felix Wolfheimer wrote:
If the preemption happens because of user priority or machine rank you
may try to set
PREEMPTION_REQUIREMENTS = False
which should switch off this type of preemption.
Actually, afaik, setting PREEMPTION_REQUIREMENTS to false will disable
user priority preemption but will not disable machine rank preemption.
The Condor Manual actually has a very enlightening section specifically
on disabling preemption, including how to do so, and also talking about
the (often overlooked) consequences and other (perhaps better)
alternatives such as allowing preemption to occur only at job
boundaries. See
http://www.cs.wisc.edu/condor/manual/v7.6/3_5Policy_Configuration.html#SECTION00459500000000000000
For convenience and in case there are follow-up questions, I
cut-n-pasted that section of the manual below.
regards,
Todd
3.5.9.5 Disabling Preemption
Preemption can result in jobs being killed by Condor. When this happens,
the jobs remain in the queue and will be automatically rescheduled. We
highly recommend designing jobs that work well in this environment,
rather than simply disabling preemption.
Planning for preemption makes jobs more robust in the face of other
sources of failure. One way to live happily with preemption is to use
Condor's standard universe, which provides the ability to produce
checkpoints. If a job is incompatible with the requirements of standard
universe, the job can still gracefully shutdown and restart by
intercepting the soft kill signal.
All that being said, there may be cases where it is appropriate to force
Condor to never kill jobs within some upper time limit. This can be
achieved with the following policy in the configuration of the execute
nodes:
# When we want to kick a job off, let it run uninterrupted for
# up to 2 days before forcing it to vacate.
MAXJOBRETIREMENTTIME = $(HOUR) * 24 * 2
Construction of this expression may be more complicated. For example, it
could provide a different retirement time to different users or
different types of jobs. Also be aware that the job may come with its
own definition of MaxJobRetirementTime, but this may only cause less
retirement time to be used, never more than what the machine offers.
The longer the retirement time that is given, the slower reallocation of
resources in the pool can become if there are long-running jobs.
However, by preventing jobs from being killed, you may decrease the
number of cycles that are wasted on non-checkpointable jobs that are
killed. That is the basic trade off.
Note that the use of MAXJOBRETIREMENTTIME limits the killing of jobs,
but it does not prevent the preemption of resource claims. Therefore, it
is technically not a way of disabling preemption, but simply a way of
forcing preempting claims to wait until an existing job finishes or runs
out of time. In other words, it limits the preemption of jobs but not
the preemption of claims.
Limiting the preemption of jobs is often more desirable than limiting
the preemption of resource claims. However, if you really do want to
limit the preemption of resource claims, the following policy may be
used. Some of these settings apply to the execute node and some apply to
the central manager, so this policy should be configured so that it is
read by both.
#Disable preemption by machine activity.
PREEMPT = False
#Disable preemption by user priority.
PREEMPTION_REQUIREMENTS = False
#Disable preemption by machine RANK by ranking all jobs equally.
RANK = 0
#Since we are disabling claim preemption, we
# may as well optimize negotiation for this case:
NEGOTIATOR_CONSIDER_PREEMPTION = False
Be aware of the consequences of this policy. Without any preemption of
resource claims, once the condor_negotiator gives the condor_schedd a
match to a machine, the condor_schedd may hold onto this claim
indefinitely, as long as the user keeps supplying more jobs to run. If
this is not desired, force claims to be retired after some amount of
time using CLAIM_WORKLIFE . This enforces a time limit, beyond which no
new jobs may be started on an existing claim; therefore the
condor_schedd daemon is forced to go back to the condor_negotiator to
request a new match, if there is still more work to do. Example execute
machine configuration to include in addition to the example above:
# after 20 minutes, schedd must renegotiate to run
# additional jobs on the machine
CLAIM_WORKLIFE = 1200