HTCondor Project List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] JobRunCount

Date: Wed, 15 Aug 2007 00:33:52 -0700
From: Derek Wright <wright@xxxxxxxxxxx>
Subject: Re: [Condor-devel] JobRunCount

Here's an update on progress...

A) I noticed that we've already got a ton of "NumXXX" attributes, sotrying to use "Number" is just inconsistent, even if it would havebeen better all along.

B) Upon further consideration, Dan's suggestion of "Starts" seemsbetter than "Launches" (especially for non-native English speakers).

C) 4 new counter attributes are now committed to CVS HEAD (therefore,they'll be in version 6.9.5):


NumJobStarts

-- updated by the "new" shadow for most universe, the starter forlocal universe, the schedd for scheduler universe, and not (yet) atall for standard universe.


NumJobReconnects
-- updated by the "new" shadow for universes that support reconnect.

NumShadowExceptions
NumShadowStarts

-- both updated by the schedd for any universes that have a shadow(e.g. all except local + scheduler).

D) There's a new config knob: SHADOW_LAZY_QUEUE_UPDATE. There's asemantics vs. performance trade-off for the counters updated by theshadow, and there's no obviously correct approach -- it reallydepends on the needs of the site, so I added a config knob (named asclosely to SHADOW_QUEUE_UPDATE_INTERVAL as possible forconsistency). This defaults to TRUE (lazy updates), which means thatwhile the shadow immediately updates its own copy of the job classadin RAM (e.g. for the periodic user job policy expressions likeperiodic_hold, etc), the counter is not updated in the job queueuntil the next periodic queue update (defaults to every 15 minutes).If you set the "lazy" knob to FALSE, as soon as the shadow knows thatthe job started running or that it successfully reconnected, theshadow will connect to the schedd and do a blocking queue update toensure the counter is written to disk. So, lazy == true is muchbetter for schedd scalability and performance, and is functionallypretty solid for the main use case for these counters (periodic userpolicy expressions). lazy == false is better for the semantics ofthe counters (e.g. if you were monitoring these via quill and reallyneeded to know immediately), but is worse for schedd performance.The default is lazy == true to prevent pain for busy schedds. Sincethe counters are new, no one can be *depending* on them gettingupdated immediately, so if they want that behavior, they can turntheir config knob accordingly.

E) NumJobStarts is initialized to 0 by condor_submit and via the SOAPinterface when you try to submit jobs. All of the other counters aresimply undefined until the relevant event happens (reconnect, shadowexception, etc).



F) Everything is documented in the 6.9.5 manual.

G) NumJobStarts isn't supported for PVM universe, and I doubt I'llever care enough to do anything about that.

H) Standard universe could still use a few more counters which don'texist yet: NumCheckpointResumes (and maybe NumJobRestarts -- though,as per my previous email, you could compute this with NumJobStarts -NumCheckpointResumes).

That's the news for now. The only real work left to do is forstandard universe jobs -- getting NumJobStarts updated correctly, anddealing with (H).



Cheers,
-Derek

Follow-Ups:
- Re: [Condor-devel] [condor-staff] JobRunCount
  - From: Jaime Frey

References:
- Re: [Condor-devel] JobRunCount
  - From: Derek Wright
- Re: [Condor-devel] JobRunCount
  - From: Erik Paulson
- Re: [Condor-devel] JobRunCount
  - From: Derek Wright

Prev by Date: Re: [Condor-devel] What does Condor _really_ depend on?
Next by Date: Re: [Condor-devel] What does Condor _really_ depend on?
Previous by thread: Re: [Condor-devel] JobRunCount
Next by thread: Re: [Condor-devel] [condor-staff] JobRunCount
Index(es):
- Date
- Thread