HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] RFC: starter-enforced eviction policy expressions



On Apr 8, 2005 12:35 AM, Derek Wright <wright@xxxxxxxxxxx> wrote:
> 
> hi folks.  one of our high-prio support customers wants to be able to
> have machine policies like "evict the job if the imagesize grows
> larger than the allocated memory on the virtual machine where it's
> running".  unfortunately, it's the starter that monitors the job's
> imagesize, not the startd, so to make the above possible using
> existing policy expressions, we'd have to provide a mechanism for the
> starter to share info with the startd.

Makes sense

external forces ->  startd  -> starter  -> job

job specific details should go via the starter

> moreover, we might want similar sorts of eviction to happen in cases
> where the starter is running without a startd at all (local universe,
> gridshell, etc).  so, i'm proposing we add some additional policy
> expressions and logic into the starter itself.  below i'm including
> the write-up i did about how i think it should all work.  if anyone
> has time to read, think about, and comment on this proposal, that'd be
> swell.  thanks,

nice idea - look forward to more of them...

> the starter currently maintains the state of the job, either "running"
> or "suspended".  this would be expanded so that the state could be any
> of: "running", "suspended", "vacating" (graceful/soft eviction), or
> "killing" (hard eviction).

what about retirement? how will this fit in...

> admins would be able to define a few new expressions in their config
> file to control the transitions between these states.  to simplify
> things, "suspended" would still only be triggered by the startd.

indeed - suspended should be a response to an external action, a job
can take care of pausing itself if it ever wants to.

> i see these policy expressions used to allow eviction if
> the job is "misbehaving".

good to have a clear use case in advance.

> whenever the job is either "running" or "suspended", the starter would
> evaluate the "STARTER_EVICT" expression to decide if we should evict
> the job (analagous to the startd's "PREEMPT" expression, which i think
> is unfortunately named, and should probably be called "EVICT").  if
> undefined, "STARTER_EVICT" would default to FALSE (don't do
> starter-based eviction at all, let the startd's policy settings
> control things).  if "STARTER_EVICT" becomes TRUE, the starter would
> evaluate "STARTER_WANT_VACATE" (analagous to the startd's
> "WANT_VACATE") to decide what kind of eviction to perform.  if
> "STARTER_WANT_VACATE" is FALSE, we go into the "killing" state and
> immediately hard-kill the job (and all its children) with a SIGKILL.

Will the starters vacate signals match those defined in the submit file?

> if "STARTER_WANT_VACATE" is TRUE, we go into the "vacating" state and
> begin a graceful eviction (sending the "KillSig" specificed in the job
> classad, or SIGTERM by default).  is "STARTER_WANT_VACATE" is
> undefined, it defaults to TRUE.

What happens if the job specified it didn't want to vacate?

if STARTER_KILL is false (or never evaluates to false) what
happens...does the startds expressions kick in if they want to trigger
a vacation - how about if someone does a vacate command directly, with
or without -fast.

I think a state transition diagram with the (potentially noop)
responses to all the various actions/commands/expression evaluations
would be useful

> while the job is in the "vacating" state, the starter would also
> evaluate the "STARTER_KILL" expression, to decide if it should give up
> on the graceful eviction and move immediately to the hard-kill
> eviction.  this is analagous to the startd's "KILL" expression.  if
> undefined, this would default to FALSE (never move to hard killing,
> allow graceful eviction to run its course).
> 
> all of these expressions would be evaluated in the context of the copy
> of the job classad used to spawn the job, and a starter-constructed
> classad that contained information the starter had gathered about the
> job, including image size, cpu usage (user and system rusage), total
> wallclock runtime, and number of children processes.  so far, that's
> all the info the starter is monitoring about the job.

If you are looking at misbehaving jobs you may want to track:

Network IO usage (very tricky what with multiple processes)
Number of open file handles/descriptors (since these are a limited resource)

An additional, possibly useful piece of functionality, is some concept
of heart beating.
If a file such as .condor.heartbeat (name controlled by a variable is
prob a good idea) was created by the job and touched every so often
then the starter could factor the time stamp on the file into a
classad expression like "LastFileHeartbeat".

Then the expressions could say things like 

CurrentTime - LastFileHeartbeat > 10 * Minute

and such like.


> NOTE: all of this would only be added to the starter that manages
> vanilla, java, MPI and local universes.  i.e. these features would NOT
> be available for standard or PVM universe.

out of interest why not? not that it matters personally being firmly
"clipped" :)

> one undecided issue is if the starter should notify the startd when it
> decides to change the job state, so that the startd's state/activity
> could reflect what the starter is doing.  if we make no such attempt,
> the startd will still report the virtual machine as "Claimed/Busy",
> even though the starter might be evicting the job. 
Not nice - this should be viewed as a temporary stopgap to enable
testing of the functionality...

Though on a related point - are these classads going to be
broadcast/stored by the collector?

> however, adding
> this kind of reporting would involve additional complications in the
> startd and would delay providing this functionality, so if we do it,
> it should probably be a "phase 2" change, something we can do
> seperately, after we get the initial changes outlined above working.

Should definitely be viewed as a necessary - whether you implement in
two phases is entirely up to you since one can always not use it till
it is complete if you like.

A good idea - though with some complexities that need some ironing out
perhaps...

Matt