[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-devel] JobRunCount
- Date: Fri, 10 Aug 2007 12:43:31 -0700
- From: Derek Wright <wright@xxxxxxxxxxx>
- Subject: Re: [Condor-devel] JobRunCount
On Aug 10, 2007, at 6:17 AM, Erik Paulson wrote:
begin_execution can fail, and isn't an quick operation - I can imagine
plenty of scenarios where without some sort of two-phase update to
NumberJobExecuted it either goes up without a starter actually being
spawned, or a starter is spawned but the job ad is not updated. Is
that
a big deal? An easier counter might be
"NumberMinimumAttemptsAtStarter"
This RSC is exactly how the shadow knows to log an execute event to
the userlog. If we never get this RSC, the job didn't actually start
executing, and if we did, we're sure it's running. If it's good
enough for the user log, it's good enough for the purposes of this
counter attribute. ;)
The reason people care is if they have a job that's not idempotent
and they really don't want it to run more than once. If it got far
enough to generate this RSC, the job is going to exec(), and
therefore, do whatever thing it can only do once. So, people want to
use this in job policy expressions, so that, for example, if the
counter is already 1 and the job is idle (evicted, etc), it should be
held or removed, not requeued to be matched, launched, etc...
On Aug 10, 2007, at 6:53 AM, Dan Bradley wrote:
Have you considered "Times" in place of "Number"? Just a thought.
I have now, but I don't like it. ;) I think it'd be easy to confuse
with time-based counters, such as "TotalJobRunTime", etc. "Time" ==
seconds, "Number" == counter.
Also, without looking up any documentation, I wouldn't have been sure
whether NumberJobExecuted means the number of times the job _finished_
executing or the number of times it started executing. That potential
confusion is not the end of the world, but you could consider
naming it
NumberJobStarted instead.
Agreed. Briefly chatted with Todd just now and we think "Launch" is
more clear than "Execute", and we'll use that for both
NumberJobLaunches and NumberShadowLaunches.
Also, I remembered that someone (I forget who) was very interested in
a job attribute for the number of shadow exceptions, so I'm adding
that to the mix...
Here's the new (perhaps final) proposal:
NumberJobLaunches
-- incremented (by both shadows) every time the starter sends
CONDOR_begin_execution
NumberShadowLaunches
-- incremented (by the schedd) every time it spawns a shadow
NumberShadowExceptions
-- incremented (by the schedd) every time a shadow dies with EXCEPT().
NumberJobReconnects
-- incremented (by the new shadow) every time it successfully
reconnects, regardless of if the starter was killed and restarted in
the meantime. So, there'd be no counter for "number of reconnects to
the current starter" in my proposal, though, you could at least tell
how many starters are in the picture via NumberJobLaunched.
NumberCheckpointResumes and NumberJobRestarts
-- std universe only: punt for now and leave NumRestarts in there as-
is. :( Probabaly we don't need NumberJobRestarts, since you should
always be able to compute that yourself with (NumberJobLaunches -
NumberCheckpointResumes)...
Any final objections?
Thanks,
-Derek