HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Job Suspension



On Tue, Apr 26, 2011 at 08:46:10AM -0500, Brian Bockelman wrote:
> Hi folks,
> 
> I have a few observations about job suspension:
> 1) Whether or not the job is currently suspended is not recorded in the ClassAd.
> 2) When a job is transitioned from running to suspended, LastSuspensionTime is updated.  However, due to (1), you don't know whether the job has since been un-suspended.
> 3) The starter updates the remote wall time, but not the suspended time.
> 
> I would like to preempt jobs based upon the non-suspended running time.  However, it doesn't appear that this is possible in the current setup.
> 
> Why is the suspension state not reflected in the job's classad?  It seems like a very important thing to note.

I had implemented something like this for the standard universe (where you
can actually know this information via condor_q). Here is what I remember:

The canonical job ad is held by the schedd and this is what you will see
when you do a condor_q. So, in order to reflect the suspension state to a
user, we need the information to go from the starter, then to the shadow,
then to the schedd.

There is a scalability issue when the shadow connects to a busy schedd to
update the suspension attributes. Since suspension stsate can rapidly
cycle on execution machines in an unbounded manner, generating an
arbitrary rate of suspension events, and there could easily be tens of
thousands of running jobs, the schedd can become extremely busy _simply_
due to suspension event updates from the shadows! Eventually, the shadows
start timing out, excepting, and restarting the job. Throughput goes down
terribly when many machines coincidentally synchronize their suspensions.
Since suspensions are a result of job policies often written by humans,
this synchronization happens more than you'd thik.

So, if we can solve the scaling problem if the suspension updates, then 
suspension can move to be a true job state, and things would be better.
But for now, it is a second class citizen.

-pete