[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-devel] JobRunCount
- Date: Wed, 15 Aug 2007 00:33:52 -0700
 
- From: Derek Wright <wright@xxxxxxxxxxx>
 
- Subject: Re: [Condor-devel] JobRunCount
 
Here's an update on progress...
A) I noticed that we've already got a ton of "NumXXX" attributes, so  
trying to use "Number" is just inconsistent, even if it would have  
been better all along.
B) Upon further consideration, Dan's suggestion of "Starts" seems  
better than "Launches" (especially for non-native English speakers).
C) 4 new counter attributes are now committed to CVS HEAD (therefore,  
they'll be in version 6.9.5):
NumJobStarts
-- updated by the "new" shadow for most universe, the starter for  
local universe, the schedd for scheduler universe, and not (yet) at  
all for standard universe.
NumJobReconnects
-- updated by the "new" shadow for universes that support reconnect.
NumShadowExceptions
NumShadowStarts
-- both updated by the schedd for any universes that have a shadow  
(e.g. all except local + scheduler).
D) There's a new config knob: SHADOW_LAZY_QUEUE_UPDATE.  There's a  
semantics vs. performance trade-off for the counters updated by the  
shadow, and there's no obviously correct approach -- it really  
depends on the needs of the site, so I added a config knob (named as  
closely to SHADOW_QUEUE_UPDATE_INTERVAL as possible for  
consistency).  This defaults to TRUE (lazy updates), which means that  
while the shadow immediately updates its own copy of the job classad  
in RAM (e.g. for the periodic user job policy expressions like  
periodic_hold, etc), the counter is not updated in the job queue  
until the next periodic queue update (defaults to every 15 minutes).   
If you set the "lazy" knob to FALSE, as soon as the shadow knows that  
the job started running or that it successfully reconnected, the  
shadow will connect to the schedd and do a blocking queue update to  
ensure the counter is written to disk.  So, lazy == true is much  
better for schedd scalability and performance, and is functionally  
pretty solid for the main use case for these counters (periodic user  
policy expressions).  lazy == false is better for the semantics of  
the counters (e.g. if you were monitoring these via quill and really  
needed to know immediately), but is worse for schedd performance.   
The default is lazy == true to prevent pain for busy schedds.  Since  
the counters are new, no one can be *depending* on them getting  
updated immediately, so if they want that behavior, they can turn  
their config knob accordingly.
E) NumJobStarts is initialized to 0 by condor_submit and via the SOAP  
interface when you try to submit jobs.  All of the other counters are  
simply undefined until the relevant event happens (reconnect, shadow  
exception, etc).
F) Everything is documented in the 6.9.5 manual.
G) NumJobStarts isn't supported for PVM universe, and I doubt I'll  
ever care enough to do anything about that.
H) Standard universe could still use a few more counters which don't  
exist yet: NumCheckpointResumes (and maybe NumJobRestarts -- though,  
as per my previous email, you could compute this with NumJobStarts -  
NumCheckpointResumes).
That's the news for now.  The only real work left to do is for  
standard universe jobs -- getting NumJobStarts updated correctly, and  
dealing with (H).
Cheers,
-Derek