HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] JobRunCount




On Aug 10, 2007, at 6:17 AM, Erik Paulson wrote:

begin_execution can fail, and isn't an quick operation - I can imagine
plenty of scenarios where without some sort of two-phase update to
NumberJobExecuted it either goes up without a starter actually being
spawned, or a starter is spawned but the job ad is not updated. Is that a big deal? An easier counter might be "NumberMinimumAttemptsAtStarter"

This RSC is exactly how the shadow knows to log an execute event to the userlog. If we never get this RSC, the job didn't actually start executing, and if we did, we're sure it's running. If it's good enough for the user log, it's good enough for the purposes of this counter attribute. ;)

The reason people care is if they have a job that's not idempotent and they really don't want it to run more than once. If it got far enough to generate this RSC, the job is going to exec(), and therefore, do whatever thing it can only do once. So, people want to use this in job policy expressions, so that, for example, if the counter is already 1 and the job is idle (evicted, etc), it should be held or removed, not requeued to be matched, launched, etc...


On Aug 10, 2007, at 6:53 AM, Dan Bradley wrote:
Have you considered "Times" in place of "Number"?  Just a thought.

I have now, but I don't like it. ;) I think it'd be easy to confuse with time-based counters, such as "TotalJobRunTime", etc. "Time" == seconds, "Number" == counter.

Also, without looking up any documentation, I wouldn't have been sure
whether NumberJobExecuted means the number of times the job _finished_
executing or the number of times it started executing.  That potential
confusion is not the end of the world, but you could consider naming it
NumberJobStarted instead.

Agreed. Briefly chatted with Todd just now and we think "Launch" is more clear than "Execute", and we'll use that for both NumberJobLaunches and NumberShadowLaunches.

Also, I remembered that someone (I forget who) was very interested in a job attribute for the number of shadow exceptions, so I'm adding that to the mix...


Here's the new (perhaps final) proposal:

NumberJobLaunches
-- incremented (by both shadows) every time the starter sends CONDOR_begin_execution

NumberShadowLaunches
-- incremented (by the schedd) every time it spawns a shadow

NumberShadowExceptions
-- incremented (by the schedd) every time a shadow dies with EXCEPT().

NumberJobReconnects
-- incremented (by the new shadow) every time it successfully reconnects, regardless of if the starter was killed and restarted in the meantime. So, there'd be no counter for "number of reconnects to the current starter" in my proposal, though, you could at least tell how many starters are in the picture via NumberJobLaunched.

NumberCheckpointResumes and NumberJobRestarts
-- std universe only: punt for now and leave NumRestarts in there as- is. :( Probabaly we don't need NumberJobRestarts, since you should always be able to compute that yourself with (NumberJobLaunches - NumberCheckpointResumes)...


Any final objections?

Thanks,
-Derek