HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Job Suspension



On Apr 26, 2011, at 10:40 AM, Peter Keller wrote:

> On Tue, Apr 26, 2011 at 08:46:10AM -0500, Brian Bockelman wrote:
>> Hi folks,
>> 
>> I have a few observations about job suspension:
>> 1) Whether or not the job is currently suspended is not recorded in the ClassAd.
>> 2) When a job is transitioned from running to suspended, LastSuspensionTime is updated.  However, due to (1), you don't know whether the job has since been un-suspended.
>> 3) The starter updates the remote wall time, but not the suspended time.
>> 
>> I would like to preempt jobs based upon the non-suspended running time.  However, it doesn't appear that this is possible in the current setup.
>> 
>> Why is the suspension state not reflected in the job's classad?  It seems like a very important thing to note.
> 
> I had implemented something like this for the standard universe (where you
> can actually know this information via condor_q). Here is what I remember:
> 
> The canonical job ad is held by the schedd and this is what you will see
> when you do a condor_q. So, in order to reflect the suspension state to a
> user, we need the information to go from the starter, then to the shadow,
> then to the schedd.
> 
> There is a scalability issue when the shadow connects to a busy schedd to
> update the suspension attributes. Since suspension stsate can rapidly
> cycle on execution machines in an unbounded manner, generating an
> arbitrary rate of suspension events, and there could easily be tens of
> thousands of running jobs, the schedd can become extremely busy _simply_
> due to suspension event updates from the shadows! Eventually, the shadows
> start timing out, excepting, and restarting the job. Throughput goes down
> terribly when many machines coincidentally synchronize their suspensions.
> Since suspensions are a result of job policies often written by humans,
> this synchronization happens more than you'd thik.
> 
> So, if we can solve the scaling problem if the suspension updates, then 
> suspension can move to be a true job state, and things would be better.
> But for now, it is a second class citizen.
> 
> -pete

I'm not sure if I buy this argument.  We already inform the schedd when the job is suspended (#1 above).  I think it would make sense to also inform the schedd when the job is unsuspended.  At worst, we're doubling the rate; my gut is that "suspend" events are more bursty than unsuspend events, so we're probably not even doubling the rate.

Brian

Attachment: smime.p7s
Description: S/MIME cryptographic signature