[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-devel] JobRunCount
- Date: Wed, 15 Aug 2007 00:33:52 -0700
- From: Derek Wright <wright@xxxxxxxxxxx>
- Subject: Re: [Condor-devel] JobRunCount
Here's an update on progress...
A) I noticed that we've already got a ton of "NumXXX" attributes, so
trying to use "Number" is just inconsistent, even if it would have
been better all along.
B) Upon further consideration, Dan's suggestion of "Starts" seems
better than "Launches" (especially for non-native English speakers).
C) 4 new counter attributes are now committed to CVS HEAD (therefore,
they'll be in version 6.9.5):
NumJobStarts
-- updated by the "new" shadow for most universe, the starter for
local universe, the schedd for scheduler universe, and not (yet) at
all for standard universe.
NumJobReconnects
-- updated by the "new" shadow for universes that support reconnect.
NumShadowExceptions
NumShadowStarts
-- both updated by the schedd for any universes that have a shadow
(e.g. all except local + scheduler).
D) There's a new config knob: SHADOW_LAZY_QUEUE_UPDATE. There's a
semantics vs. performance trade-off for the counters updated by the
shadow, and there's no obviously correct approach -- it really
depends on the needs of the site, so I added a config knob (named as
closely to SHADOW_QUEUE_UPDATE_INTERVAL as possible for
consistency). This defaults to TRUE (lazy updates), which means that
while the shadow immediately updates its own copy of the job classad
in RAM (e.g. for the periodic user job policy expressions like
periodic_hold, etc), the counter is not updated in the job queue
until the next periodic queue update (defaults to every 15 minutes).
If you set the "lazy" knob to FALSE, as soon as the shadow knows that
the job started running or that it successfully reconnected, the
shadow will connect to the schedd and do a blocking queue update to
ensure the counter is written to disk. So, lazy == true is much
better for schedd scalability and performance, and is functionally
pretty solid for the main use case for these counters (periodic user
policy expressions). lazy == false is better for the semantics of
the counters (e.g. if you were monitoring these via quill and really
needed to know immediately), but is worse for schedd performance.
The default is lazy == true to prevent pain for busy schedds. Since
the counters are new, no one can be *depending* on them getting
updated immediately, so if they want that behavior, they can turn
their config knob accordingly.
E) NumJobStarts is initialized to 0 by condor_submit and via the SOAP
interface when you try to submit jobs. All of the other counters are
simply undefined until the relevant event happens (reconnect, shadow
exception, etc).
F) Everything is documented in the 6.9.5 manual.
G) NumJobStarts isn't supported for PVM universe, and I doubt I'll
ever care enough to do anything about that.
H) Standard universe could still use a few more counters which don't
exist yet: NumCheckpointResumes (and maybe NumJobRestarts -- though,
as per my previous email, you could compute this with NumJobStarts -
NumCheckpointResumes).
That's the news for now. The only real work left to do is for
standard universe jobs -- getting NumJobStarts updated correctly, and
dealing with (H).
Cheers,
-Derek