HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] JobRunCount



Here's an update on progress...

A) I noticed that we've already got a ton of "NumXXX" attributes, so trying to use "Number" is just inconsistent, even if it would have been better all along.


B) Upon further consideration, Dan's suggestion of "Starts" seems better than "Launches" (especially for non-native English speakers).


C) 4 new counter attributes are now committed to CVS HEAD (therefore, they'll be in version 6.9.5):

NumJobStarts
-- updated by the "new" shadow for most universe, the starter for local universe, the schedd for scheduler universe, and not (yet) at all for standard universe.

NumJobReconnects
-- updated by the "new" shadow for universes that support reconnect.

NumShadowExceptions
NumShadowStarts
-- both updated by the schedd for any universes that have a shadow (e.g. all except local + scheduler).


D) There's a new config knob: SHADOW_LAZY_QUEUE_UPDATE. There's a semantics vs. performance trade-off for the counters updated by the shadow, and there's no obviously correct approach -- it really depends on the needs of the site, so I added a config knob (named as closely to SHADOW_QUEUE_UPDATE_INTERVAL as possible for consistency). This defaults to TRUE (lazy updates), which means that while the shadow immediately updates its own copy of the job classad in RAM (e.g. for the periodic user job policy expressions like periodic_hold, etc), the counter is not updated in the job queue until the next periodic queue update (defaults to every 15 minutes). If you set the "lazy" knob to FALSE, as soon as the shadow knows that the job started running or that it successfully reconnected, the shadow will connect to the schedd and do a blocking queue update to ensure the counter is written to disk. So, lazy == true is much better for schedd scalability and performance, and is functionally pretty solid for the main use case for these counters (periodic user policy expressions). lazy == false is better for the semantics of the counters (e.g. if you were monitoring these via quill and really needed to know immediately), but is worse for schedd performance. The default is lazy == true to prevent pain for busy schedds. Since the counters are new, no one can be *depending* on them getting updated immediately, so if they want that behavior, they can turn their config knob accordingly.


E) NumJobStarts is initialized to 0 by condor_submit and via the SOAP interface when you try to submit jobs. All of the other counters are simply undefined until the relevant event happens (reconnect, shadow exception, etc).


F) Everything is documented in the 6.9.5 manual.


G) NumJobStarts isn't supported for PVM universe, and I doubt I'll ever care enough to do anything about that.


H) Standard universe could still use a few more counters which don't exist yet: NumCheckpointResumes (and maybe NumJobRestarts -- though, as per my previous email, you could compute this with NumJobStarts - NumCheckpointResumes).


That's the news for now. The only real work left to do is for standard universe jobs -- getting NumJobStarts updated correctly, and dealing with (H).


Cheers,
-Derek