Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Understanding Condor Policies on Jobs
- Date: Fri, 22 Mar 2013 17:06:37 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Understanding Condor Policies on Jobs
On 3/22/2013 3:18 AM, Andrey Kuznetsov wrote:
I've been reading the documentation, and slowly figuring out what is
what, but some things are unclear.
If you are going to be the HTCondor admin at your site, you may be
interested in viewing one of our HTCondorWeek Administration tutorials
from a past HTCondor Week workshop (or come to Madison this May for live
tutorials!). Materials/slides from past HTCondor Weeks are online; back
in 2008 Red Hat was kind enough to record some of them. At URL
http://research.cs.wisc.edu/htcondor/tutorials/videos/cw2008/
see the "Administrating Condor" tutorial, which I think covers most of
the concepts/questions you are asking below.
I will take a quick pass at answering your questions inline below, but
of course the tutorial does a much better job than my pithy comments...
From the documentation, WANT_SUSPEND = A boolean expression that, when
True, tells Condor to evaluate the SUSPEND expression.
SUSPEND = A boolean expression that, when True, causes Condor to
suspend running a Condor job. The machine may still be claimed, but
the job makes no further progress, and Condor does not generate a load
on the machine.
From default config, UWCS_WANT_SUSPEND = ( $(SmallJob) ||
$(KeyboardNotBusy) || $(IsVanilla) ) && ( $(SUSPEND) )
So SUSPEND will be evaluated if the job is small, likely some kind of
error in the job, but I am having trouble understanding the rest.
The default UWCS policy expressions in the default config are not as
simple as they could (should?) be. For better or worse, these
expressions relate to the default policy that was in use at the
UW-Madison Computer Sciences department a while back. Something you
should know is there are a lot of standard universe (aka relinked with
condor_compile so they can checkpoint, and with 'universe=standard' in
the submit description file) jobs submitted at UW-Madison. Since
standard universe jobs can checkpoint and restart right where they left
off, the UWCS policy expressions are optimized in many places to take
advantage of that. If you are not relinking with condor_compile, you
probably submitting vanilla universe jobs. Off the top of my head, a
simple setup for HTCondor to relinquish one processor core when someone
is typing either on the console or via ssh would be:
# Jobs can start anytime on slots > 1, and also can
# start on slot 1 if there has been no keyboard activity for 15 min
START = SlotID > 1 || KeyboardIdle > 900
# When we see keyboard activity on Slot1, send the job a SIGTERM
# and if the job is still around 10 seconds later send a SIGKILL.
WANT_SUSPEND = False
WANT_VACATE = True
PREEMPT = SlotID > 1 && KeyboardIdle < 60
MachineMaxVacateTime = 10
KILL = False
Note that all the slot (machine) classads will be numbered via an
attribute SlotID (SlotID=1, SlotID=2, etc), and KeyboardIdle will be the
number of seconds the keyboard (or ssh) has not had any keystrokes.
Warning: I didn't test the above, I just wrote it in my email client :)
More inline below...
1) Why is SUSPEND evaluated if there is no user at the keyboard
"KeyboardNotBusy", shouldn't it be the opposite? If the keyboard is
busy then I want the SUSPEND to be evaluated on the basis that someone
is using the machine, thus I want the job to be suspended to free
resources/processor for the user.
Note that UWCS_WANT_SUSPEND says "... $(KeyboardNotBusy) || $(IsVanilla)
...".
So for vanilla jobs, it indeed works the way you thought it should. It
is only if the job is not vanilla that we KeyboardNotBusy comes into
play. The thinking here is if the job is standard universe, don't
bother suspending the job, just checkpoint and migrate it to a different
machine right away.
2) Why is SUSPEND evaluated when the job is running in VANILLA
universe? We are submitting jobs under VANILLA universe and add our
own environmental variables inside the jobs. It doesn't make sense why
condor would attempt to suspend a VANILLA universe job.
The thinking is VANILLA jobs cannot necessarily be checkpointed, and
thus if they are bumped off the machine they would have to restart from
the beginning. So the idea of suspending the job for a few minutes
before killing it off is in hopes that the keyboard user will go away
soon. Kinda a bummer if you have a job that runs for 12 hours, and at
hour 11 a guy just checks his email for 3 minutes then leaves... may be
better to simply suspend the job for 3 minutes instead of forcing the
job to start over and loose 11 hours of computing. (of course,
suspending may irritate some users... while a suspended job uses no CPU,
it will still consume RAM and/or virtual memory)
3) Why is SUSPEND in WANT_SUSPEND since when WANT_SUSPEND=TRUE, then
SUSPEND is evaluated, seems kind of redundant?!
I guess it is not how I would have written it...
Regarding, UWCS_CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 10) &&
(KeyboardIdle > $(ContinueIdleTime)) )
ActivityTimer = Amount of time in seconds in the current activity.
4) What kind of activity is the timer tracking? CONTINUE is supposed
to reactivate a suspended job, that means that when the machine is
free from users and nothing is running on it, then ActivityTimer is
somehow supposed to be non-zero, and thus > 10, so what is it
tracking? Is ActivityTimer tracking the time since last user
click/interaction was made, thus if the user steps away for more than
10 seconds, condor job will continue/resume?
Slots in HTCondor are always in a specific state and activity. You see
this when you do condor_status. When HTCondor suspends a job (when
SUSPEND becomes true), that slot will change from acivity "Busy" to
activity "Suspended" and then HTCondor evaluates CONTINUE. So in the
above, $(ActivityTimer) timer represents the number of seconds the slot
has been in the "Suspended" activity.
5) What's the purpose of WANT_SUSPEND and SUSPEND? Seems like they
accomplish the same thing, except you run the check twice. Does
WANT_SUSPEND has some other kind of use?
While a job is running, if WANT_SUSPEND is True, HTCondor startd will
continuously evaluate the SUSPEND expression. If WANT_SUSPEND is FALSE,
it will not even look at the SUSPEND expression and will just
continuously evaluate the the PREEMPT expression. So essentially it is
just a way enable folks to write less complicated expressions.
6) Why are some variable in the config in the bash form, and others
not, or is it a typo?
Take a look at where SUSPEND is evaluated:
UWCS_WANT_SUSPEND = ( $(SmallJob) || $(KeyboardNotBusy) ||
$(IsVanilla) ) && ( $(SUSPEND) )
UWCS_PREEMPT = ( ((Activity == "Suspended") && ($(ActivityTimer) >
$(MaxSuspendTime))) || (SUSPEND && (WANT_SUSPEND == False)) )
The ones in bash form aka $() just simple macros expanded from elsewhere
in the condor_config file. The ones without $() are likely referring to
ClassAd attributes, which are either characteristics about the machine
or characteristics of the job. I think the tutorials cover this pretty
well...
7) Are variables case sensitive? In condor_config_var, they are
printed as all capitals, but in the defaults UWCS they are used often
as lower-case with first capital letters of the word:
"$(ActivityTimer)" vs "ACTIVITYTIMER = (time() -
EnteredCurrentActivity)"
Macro and attribute names are both case-insensitive. For instance,
$(Hour) and $(HOUR) are interchangeable.
8) How do you differentiate between variables set/updated by condor
and variables that you define? Like SUSPEND is defined in the config
by user, but "KeyboardIdle" is not in the config.
If it has $() it is from the config file, if it does not have $() that
means it is referring to an attribute about the machine (or job).
9) What is =?= and =!= ?
See
http://research.cs.wisc.edu/htcondor/manual/v7.9/4_1HTCondor_s_ClassAd.html#SECTION00513400000000000000
Essentially, what happens if you write foo == 5, but foo is not defined?
Should it be true? False? In HTCondor, it will not be True or False,
but will evaluate to UNDEFINED. This so-called three-value logic is
common in databases as well (think the Null value). Three-value logic
lets folks write policies that explicitly deal with cases where
information is missing (i.e. i want folks to submit jobs and tell me
their department in the submit file, and want to do something special if
someone forgot to specify their department). If you never want to deal
with UNDEFINED and just want good-ol boolean two-value logic, use =?=
instead of ==, and =!= instead of !=.
I am using:
SLOTS_CONNECTED_TO_CONSOLE = 1
SLOTS_CONNECTED_TO_KEYBOARD = 1
10) How does condor know which SlotID to reserve for the user when the
desktop is being used? Where is this set?
No idea off the top of my head. Note in my simple example above, I
didn't bother with SLOTS_CONNECTED_TO_KEYBOARD myself, and instead
explicitly referenced SlotID in my Start/Preempt expressions. Seems
more clear/explicit to me (but in more complex configurations it may
make more sense to use SLOTS_CONNECTED_TO_KEYBOARD...).
Here's what my SUSPEND looks line:
SUSPEND = ( ($(KeyboardBusy) || $(ConsoleBusy)) && ((SlotID <=
SLOTS_CONNECTED_TO_CONSOLE) || (SlotID <= SLOTS_CONNECTED_TO_CONSOLE))
&& $(ActivationTimer) > 90)
In other words, if console or keyboard is being used, and the SlotID
is 1, meaning processor #1 out of a total of 4 processors (cores) in
my computer, and the job is mature, has been running for some time,
then suspend the job.
PREEMPT = ( ((Activity == "Suspended") && ($(ActivityTimer) >
$(MaxSuspendTime))) || (SUSPEND) )
WANT_SUSPEND = ( $(SmallJob) || $(KeyboardBusy) || $(ConsoleBusy) )
CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 10) && (KeyboardIdle >
$(ContinueIdleTime)) )
I welcome any suggestions to improve my attempts at forcing condor to
relinquish 1 processor when a user is utilizing the computer.
Thank you very much for taking a look.
Hope the above helps and welcome to HTCondor,
Todd