HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] RFC: starter-enforced eviction policy expressions



hi folks.  one of our high-prio support customers wants to be able to
have machine policies like "evict the job if the imagesize grows
larger than the allocated memory on the virtual machine where it's
running".  unfortunately, it's the starter that monitors the job's
imagesize, not the startd, so to make the above possible using
existing policy expressions, we'd have to provide a mechanism for the
starter to share info with the startd.

moreover, we might want similar sorts of eviction to happen in cases
where the starter is running without a startd at all (local universe,
gridshell, etc).  so, i'm proposing we add some additional policy
expressions and logic into the starter itself.  below i'm including
the write-up i did about how i think it should all work.  if anyone
has time to read, think about, and comment on this proposal, that'd be
swell.  thanks,

-d

------- Forwarded Message

the starter currently maintains the state of the job, either "running"
or "suspended".  this would be expanded so that the state could be any
of: "running", "suspended", "vacating" (graceful/soft eviction), or
"killing" (hard eviction).

admins would be able to define a few new expressions in their config
file to control the transitions between these states.  to simplify
things, "suspended" would still only be triggered by the startd.  if
we allowed the starter to also control suspend/resume, we could get
into very complicated situations where the startd told the starter to
suspend the job, but then the starter decided to resume it, etc.
plus, i don't see the desire for the starter to suspend a job on its
own.  mostly, i see these policy expressions used to allow eviction if
the job is "misbehaving".


whenever the job is either "running" or "suspended", the starter would
evaluate the "STARTER_EVICT" expression to decide if we should evict
the job (analagous to the startd's "PREEMPT" expression, which i think
is unfortunately named, and should probably be called "EVICT").  if
undefined, "STARTER_EVICT" would default to FALSE (don't do
starter-based eviction at all, let the startd's policy settings
control things).  if "STARTER_EVICT" becomes TRUE, the starter would
evaluate "STARTER_WANT_VACATE" (analagous to the startd's
"WANT_VACATE") to decide what kind of eviction to perform.  if
"STARTER_WANT_VACATE" is FALSE, we go into the "killing" state and
immediately hard-kill the job (and all its children) with a SIGKILL.
if "STARTER_WANT_VACATE" is TRUE, we go into the "vacating" state and
begin a graceful eviction (sending the "KillSig" specificed in the job
classad, or SIGTERM by default).  is "STARTER_WANT_VACATE" is
undefined, it defaults to TRUE.

while the job is in the "vacating" state, the starter would also
evaluate the "STARTER_KILL" expression, to decide if it should give up
on the graceful eviction and move immediately to the hard-kill
eviction.  this is analagous to the startd's "KILL" expression.  if
undefined, this would default to FALSE (never move to hard killing,
allow graceful eviction to run its course).

all of these expressions would be evaluated in the context of the copy
of the job classad used to spawn the job, and a starter-constructed
classad that contained information the starter had gathered about the
job, including image size, cpu usage (user and system rusage), total
wallclock runtime, and number of children processes.  so far, that's
all the info the starter is monitoring about the job.  additionally,
we would include attributes that listed the current job state
(described above) and time we entered the current state.  finally,
when the startd spawns the starter, it would pass in some information
about the static partitioning of shared resources between virtual
machines, so the starter would know the total RAM its VM was
configured to have.  dynamic attributes like disk space, swap space,
load average, etc are harder to share between the startd and starter,
but if it turned out those were important to have, we could add
support to have those attributes available to the starter's policy
expressions, too.


so, in summary, the following new config file settings would be
supported:

STARTER_EVICT        # should we evict at all?
STARTER_WANT_VACATE  # if so, should we do a soft or hard eviction?
STARTER_KILL         # if we're soft evicting, should we move to hard?


an example of how you might use these:

STARTER_EVICT = ImageSize > (Memory * 1024)
STARTER_WANT_VACATE = True
STARTER_KILL = (CurrentTime - EnteredCurrentState) > 5 * $(MINUTE)

that would mean: evict any job where the image size has grown larger
than the allocated memory for this virtual machine.  do soft evict at
first, but if that takes more than 5 minutes, hard-kill the job.
(unfortunately, the ImageSize is computed/specified in kbytes, while
Memory uses megs, which is why you need the factor of 1024... future
versions of condor will avoid this problem).


NOTE: all of this would only be added to the starter that manages
vanilla, java, MPI and local universes.  i.e. these features would NOT
be available for standard or PVM universe.


one undecided issue is if the starter should notify the startd when it
decides to change the job state, so that the startd's state/activity
could reflect what the starter is doing.  if we make no such attempt,
the startd will still report the virtual machine as "Claimed/Busy",
even though the starter might be evicting the job.  however, adding
this kind of reporting would involve additional complications in the
startd and would delay providing this functionality, so if we do it,
it should probably be a "phase 2" change, something we can do
seperately, after we get the initial changes outlined above working.


------- End of Forwarded Message