[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Understanding Condor Policies on Jobs



On 3/22/2013 3:18 AM, Andrey Kuznetsov wrote:
I've been reading the documentation, and slowly figuring out what is
what, but some things are unclear.

If you are going to be the HTCondor admin at your site, you may be 
interested in viewing one of our HTCondorWeek Administration tutorials 
from a past HTCondor Week workshop (or come to Madison this May for live 
tutorials!).  Materials/slides from past HTCondor Weeks are online; back 
in 2008 Red Hat was kind enough to record some of them. At URL
  http://research.cs.wisc.edu/htcondor/tutorials/videos/cw2008/
see the "Administrating Condor" tutorial, which I think covers most of the concepts/questions you are asking below.
I will take a quick pass at answering your questions inline below, but 
of course the tutorial does a much better job than my pithy comments...
From the documentation, WANT_SUSPEND = A boolean expression that, when
True, tells Condor to evaluate the SUSPEND expression.
SUSPEND = A boolean expression that, when True, causes Condor to
suspend running a Condor job. The machine may still be claimed, but
the job makes no further progress, and Condor does not generate a load
on the machine.
From default config, UWCS_WANT_SUSPEND = ( $(SmallJob) ||
$(KeyboardNotBusy) || $(IsVanilla) ) && ( $(SUSPEND) )
So SUSPEND will be evaluated if the job is small, likely some kind of
error in the job, but I am having trouble understanding the rest.
The default UWCS policy expressions in the default config are not as 
simple as they could (should?) be.  For better or worse, these 
expressions relate to the default policy that was in use at the 
UW-Madison Computer Sciences department a while back.  Something you 
should know is there are a lot of standard universe (aka relinked with 
condor_compile so they can checkpoint, and with 'universe=standard' in 
the submit description file) jobs submitted at UW-Madison.  Since 
standard universe jobs can checkpoint and restart right where they left 
off, the UWCS policy expressions are optimized in many places to take 
advantage of that.  If you are not relinking with condor_compile, you 
probably submitting vanilla universe jobs.  Off the top of my head, a 
simple setup for HTCondor to relinquish one processor core when someone 
is typing either on the console or via ssh would be:
  # Jobs can start anytime on slots > 1, and also can
  # start on slot 1 if there has been no keyboard activity for 15 min
  START = SlotID > 1 || KeyboardIdle > 900
  # When we see keyboard activity on Slot1, send the job a SIGTERM
  # and if the job is still around 10 seconds later send a SIGKILL.
  WANT_SUSPEND = False
  WANT_VACATE = True
  PREEMPT = SlotID > 1 && KeyboardIdle < 60
  MachineMaxVacateTime = 10
  KILL = False

Note that all the slot (machine) classads will be numbered via an attribute SlotID (SlotID=1, SlotID=2, etc), and KeyboardIdle will be the number of seconds the keyboard (or ssh) has not had any keystrokes.
Warning: I didn't test the above, I just wrote it in my email client :)

More inline below...

1) Why is SUSPEND evaluated if there is no user at the keyboard
"KeyboardNotBusy", shouldn't it be the opposite? If the keyboard is
busy then I want the SUSPEND to be evaluated on the basis that someone
is using the machine, thus I want the job to be suspended to free
resources/processor for the user.
Note that UWCS_WANT_SUSPEND says "... $(KeyboardNotBusy) || $(IsVanilla) 
...".
So for vanilla jobs, it indeed works the way you thought it should.  It 
is only if the job is not vanilla that we KeyboardNotBusy comes into 
play.  The thinking here is if the job is standard universe, don't 
bother suspending the job, just checkpoint and migrate it to a different 
machine right away.
2) Why is SUSPEND evaluated when the job is running in VANILLA
universe? We are submitting jobs under VANILLA universe and add our
own environmental variables inside the jobs. It doesn't make sense why
condor would attempt to suspend a VANILLA universe job.
The thinking is VANILLA jobs cannot necessarily be checkpointed, and 
thus if they are bumped off the machine they would have to restart from 
the beginning. So the idea of suspending the job for a few minutes 
before killing it off is in hopes that the keyboard user will go away 
soon.  Kinda a bummer if you have a job that runs for 12 hours, and at 
hour 11 a guy just checks his email for 3 minutes then leaves...  may be 
better to simply suspend the job for 3 minutes instead of forcing the 
job to start over and loose 11 hours of computing.  (of course, 
suspending may irritate some users... while a suspended job uses no CPU, 
it will still consume RAM and/or virtual memory)

3) Why is SUSPEND in WANT_SUSPEND since when WANT_SUSPEND=TRUE, then
SUSPEND is evaluated, seems kind of redundant?!

I guess it is not how I would have written it...

Regarding, UWCS_CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 10) &&
(KeyboardIdle > $(ContinueIdleTime)) )
ActivityTimer = Amount of time in seconds in the current activity.
4) What kind of activity is the timer tracking? CONTINUE is supposed
to reactivate a suspended job, that means that when the machine is
free from users and nothing is running on it, then ActivityTimer is
somehow supposed to be non-zero, and thus > 10, so what is it
tracking? Is ActivityTimer tracking the time since last user
click/interaction was made, thus if the user steps away for more than
10 seconds, condor job will continue/resume?

Slots in HTCondor are always in a specific state and activity. You see 
this when you do condor_status. When HTCondor suspends a job (when 
SUSPEND becomes true), that slot will change from acivity "Busy"  to 
activity "Suspended" and then HTCondor evaluates CONTINUE.  So in the 
above, $(ActivityTimer) timer represents the number of seconds the slot 
has been in the "Suspended" activity.
5) What's the purpose of WANT_SUSPEND and SUSPEND? Seems like they
accomplish the same thing, except you run the check twice. Does
WANT_SUSPEND has some other kind of use?

While a job is running, if WANT_SUSPEND is True, HTCondor startd will 
continuously evaluate the SUSPEND expression.  If WANT_SUSPEND is FALSE, 
it will not even look at the SUSPEND expression and will just 
continuously evaluate the the PREEMPT expression.  So essentially it is 
just a way enable folks to write less complicated expressions.
6) Why are some variable in the config in the bash form, and others
not, or is it a typo?
Take a look at where SUSPEND is evaluated:
UWCS_WANT_SUSPEND = ( $(SmallJob) || $(KeyboardNotBusy) ||
$(IsVanilla) ) && ( $(SUSPEND) )
UWCS_PREEMPT = ( ((Activity == "Suspended") && ($(ActivityTimer) >
$(MaxSuspendTime))) || (SUSPEND && (WANT_SUSPEND == False)) )

The ones in bash form aka $() just simple macros expanded from elsewhere 
in the condor_config file.  The ones without $() are likely referring to 
ClassAd attributes, which are either characteristics about the machine 
or characteristics of the job.  I think the tutorials cover this pretty 
well...
7) Are variables case sensitive? In condor_config_var, they are
printed as all capitals, but in the defaults UWCS they are used often
as lower-case with first capital letters of the word:
"$(ActivityTimer)" vs "ACTIVITYTIMER = (time() -
EnteredCurrentActivity)"

Macro and attribute names are both case-insensitive.  For instance, 
$(Hour) and $(HOUR) are interchangeable.
8) How do you differentiate between variables set/updated by condor
and variables that you define? Like SUSPEND is defined in the config
by user, but "KeyboardIdle" is not in the config.

If it has $() it is from the config file, if it does not have $() that 
means it is referring to an attribute about the machine (or job).
9) What is =?= and =!= ?

See
http://research.cs.wisc.edu/htcondor/manual/v7.9/4_1HTCondor_s_ClassAd.html#SECTION00513400000000000000

Essentially, what happens if you write foo == 5, but foo is not defined? Should it be true? False? In HTCondor, it will not be True or False, but will evaluate to UNDEFINED. This so-called three-value logic is common in databases as well (think the Null value). Three-value logic lets folks write policies that explicitly deal with cases where information is missing (i.e. i want folks to submit jobs and tell me their department in the submit file, and want to do something special if someone forgot to specify their department). If you never want to deal with UNDEFINED and just want good-ol boolean two-value logic, use =?= instead of ==, and =!= instead of !=.

I am using:
SLOTS_CONNECTED_TO_CONSOLE = 1
SLOTS_CONNECTED_TO_KEYBOARD = 1

10) How does condor know which SlotID to reserve for the user when the
desktop is being used? Where is this set?

No idea off the top of my head.  Note in my simple example above, I 
didn't bother with SLOTS_CONNECTED_TO_KEYBOARD myself, and instead 
explicitly referenced SlotID in my Start/Preempt expressions.  Seems 
more clear/explicit to me (but in more complex configurations it may 
make more sense to use SLOTS_CONNECTED_TO_KEYBOARD...).
Here's what my SUSPEND looks line:
SUSPEND = ( ($(KeyboardBusy) || $(ConsoleBusy)) && ((SlotID <=
SLOTS_CONNECTED_TO_CONSOLE) || (SlotID <= SLOTS_CONNECTED_TO_CONSOLE))
&& $(ActivationTimer) > 90)
In other words, if console or keyboard is being used, and the SlotID
is 1, meaning processor #1 out of a total of 4 processors (cores) in
my computer, and the job is mature, has been running for some time,
then suspend the job.
PREEMPT = ( ((Activity == "Suspended") && ($(ActivityTimer) >
$(MaxSuspendTime))) || (SUSPEND) )
WANT_SUSPEND = ( $(SmallJob) || $(KeyboardBusy) || $(ConsoleBusy) )
CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 10) && (KeyboardIdle >
$(ContinueIdleTime)) )

I welcome any suggestions to improve my attempts at forcing condor to
relinquish 1 processor when a user is utilizing the computer.

Thank you very much for taking a look.


Hope the above helps and welcome to HTCondor,
Todd