[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-devel] JobRunCount
- Date: Fri, 10 Aug 2007 03:08:10 -0700
- From: Derek Wright <wright@xxxxxxxxxxx>
- Subject: Re: [Condor-devel] JobRunCount
(For those new to this thread, please see the forwarded message at
the bottom for background -- basically, various counters in the job
classad regarding shadow and starter activity are either misleading
or missing, and we need to solve it).
Here are my concerns:
a) the names are all inconsistent (JobFooCount vs NumFoo vs. JobNumFoo)
b) Sanjay obviously needs a real counter for Foo = "Job started
executing"
c) we should probably keep something for Foo = "Another shadow was
spawned" (what "JobRunCount" currently gives you).
d) we should probably add something for Foo = "Successfully reconnected"
e) std universe should have Foo = "Resumed from checkpoint". (b)
should also be incremented when this happens, though perhaps we
should also have a std-universe only Foo = "Job (re)started from the
beginning."
Re (a): It seems like we're starting to move towards using "Job" at
the front of job attrs like this, but I'm not convinced that's a good
idea. If they're in a job ad, it's obvious they're about a job. And
I don't see how any of this would make sense anywhere except a job
classad.
In general for the names, I think we're better served without so many
abbreviations. I'd rather the attribute names were slightly longer
so they make more immediate sense to the humans interacting with them.
Therefore, here's my current straw-man proposal:
(b) NumberJobExecuted
-- incremented (by both shadows) every time the starter sends
CONDOR_begin_execution
(c) NumberShadowSpawned
-- incremented (by the schedd) every time it spawns a shadow
(d) NumberJobReconnected
-- incremented (by the new shadow) every time it successfully
reconnects, regardless of if the starter was killed and restarted in
the meantime. So, there'd be no counter for "number of reconnects to
the current starter" in my proposal, though, you could at least tell
how many starters are in the picture via NumberJobExecuted.
(e) NumberResumedFromCheckpoint and NumberJobRestarted
-- std universe only: punt for now and leave NumRestarts in there as-
is. :(
Short term, we leave JobRunCount a synonym for NumberShadowSpawned,
but call it deprecated and encourage people to use one of these new
Number* attrs instead.
It's just after 3am, so I'm not the most coherent I can be, but I
don't think this is insane. I believe it'll provide all the counters
we know we need now and can foresee needing anytime soon. If there
are any last minute counter-proposals for the names and semantics,
let's hear them, but we need this done ASAP. The code is not that
hard (and either Todd or I will get to it before this weekend), all
we really need is to finalize the names.
The most obvious counter proposal is s/Number/Num/ for all of the
above. If someone feels strongly that saving the 3 chars per
attribute is important, go ahead and try to convince me. But the
space implications are 6 bytes for your usual vanilla job, 9 for one
that had network troubles and reconnected, and at most 12 for a std
universe job. I think that's a tiny price to pay (even on a huge job
queue) for better readability. OTOH, you could argue "Num" is 100%
self-evident in this case. I don't care that much either way.
The other obvious possible amendment is a
"NumberJobReconnectedToThisStarter" (with a more catchy name). But,
I don't think we need it, and we shouldn't add it until someone
claims they need it and they convince us. ;)
Cheers,
-Derek
p.s. Original email thread that started this discussion:
On Jul 24, 2007, at 9:23 AM, Derek Wright wrote:
On Jul 24, 2007, at 8:50 AM, Sanjay Padhi wrote:
It is hard to know in Condor if the JobRunCount > 1 because of:
1. Either the startd was dead and the negotiator matched to a new
startd.
or
2. Simply it is the same startd, but with a different shadow -
reconnect ..
Is possible to differentiate between the two ??
Not currently, but it would be relatively easy to add another
attribute to the job classad like JobNumReconnects or something.
Honestly, the semantics of all of these counters aren't entirely
clear to my (still waking up) mind...
JobRunCount is really "how many times did a shadow get started for
this job"
NumRestarts is only in standard universe, and really means "how
many times did we resume from a checkpoint". It's not restarting
at all. :(
We don't have an accurate count for "how many times did a starter
get started for this job"
JobNumReconnects would (presumably) be increasing over the life of
the job, even if it ended up getting completely killed and
restarted at the beginning, then another reconnect happened.
Anyway, this is sort of a mess. I'd rather not just throw another
ill-thought-out counter into the mix, but I'll agree that some way
to differentiate "vanilla job restarted from the beginning" vs.
"vanilla job reconnected after schedd/shadow crash" would be
useful. I'll talk to some other developers and try to come up with
a reasonable plan. Stay tuned.
Cheers,
-Derek