HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] JobRunCount



(For those new to this thread, please see the forwarded message at the bottom for background -- basically, various counters in the job classad regarding shadow and starter activity are either misleading or missing, and we need to solve it).

Here are my concerns:

a) the names are all inconsistent (JobFooCount vs NumFoo vs. JobNumFoo)

b) Sanjay obviously needs a real counter for Foo = "Job started executing"

c) we should probably keep something for Foo = "Another shadow was spawned" (what "JobRunCount" currently gives you).

d) we should probably add something for Foo = "Successfully reconnected"

e) std universe should have Foo = "Resumed from checkpoint". (b) should also be incremented when this happens, though perhaps we should also have a std-universe only Foo = "Job (re)started from the beginning."


Re (a): It seems like we're starting to move towards using "Job" at the front of job attrs like this, but I'm not convinced that's a good idea. If they're in a job ad, it's obvious they're about a job. And I don't see how any of this would make sense anywhere except a job classad.

In general for the names, I think we're better served without so many abbreviations. I'd rather the attribute names were slightly longer so they make more immediate sense to the humans interacting with them.


Therefore, here's my current straw-man proposal:

(b) NumberJobExecuted
-- incremented (by both shadows) every time the starter sends CONDOR_begin_execution

(c) NumberShadowSpawned
-- incremented (by the schedd) every time it spawns a shadow

(d) NumberJobReconnected
-- incremented (by the new shadow) every time it successfully reconnects, regardless of if the starter was killed and restarted in the meantime. So, there'd be no counter for "number of reconnects to the current starter" in my proposal, though, you could at least tell how many starters are in the picture via NumberJobExecuted.

(e) NumberResumedFromCheckpoint and NumberJobRestarted
-- std universe only: punt for now and leave NumRestarts in there as- is. :(


Short term, we leave JobRunCount a synonym for NumberShadowSpawned, but call it deprecated and encourage people to use one of these new Number* attrs instead.

It's just after 3am, so I'm not the most coherent I can be, but I don't think this is insane. I believe it'll provide all the counters we know we need now and can foresee needing anytime soon. If there are any last minute counter-proposals for the names and semantics, let's hear them, but we need this done ASAP. The code is not that hard (and either Todd or I will get to it before this weekend), all we really need is to finalize the names.

The most obvious counter proposal is s/Number/Num/ for all of the above. If someone feels strongly that saving the 3 chars per attribute is important, go ahead and try to convince me. But the space implications are 6 bytes for your usual vanilla job, 9 for one that had network troubles and reconnected, and at most 12 for a std universe job. I think that's a tiny price to pay (even on a huge job queue) for better readability. OTOH, you could argue "Num" is 100% self-evident in this case. I don't care that much either way.

The other obvious possible amendment is a "NumberJobReconnectedToThisStarter" (with a more catchy name). But, I don't think we need it, and we shouldn't add it until someone claims they need it and they convince us. ;)


Cheers,
-Derek


p.s. Original email thread that started this discussion:


On Jul 24, 2007, at 9:23 AM, Derek Wright wrote:
On Jul 24, 2007, at 8:50 AM, Sanjay Padhi wrote:

 It is hard to know in Condor if the JobRunCount > 1 because of:

1. Either the startd was dead and the negotiator matched to a new startd.

or

2. Simply it is the same startd, but with a different shadow - reconnect ..

Is possible to differentiate between the two ??

Not currently, but it would be relatively easy to add another attribute to the job classad like JobNumReconnects or something.

Honestly, the semantics of all of these counters aren't entirely clear to my (still waking up) mind...

JobRunCount is really "how many times did a shadow get started for this job"

NumRestarts is only in standard universe, and really means "how many times did we resume from a checkpoint". It's not restarting at all. :(

We don't have an accurate count for "how many times did a starter get started for this job"

JobNumReconnects would (presumably) be increasing over the life of the job, even if it ended up getting completely killed and restarted at the beginning, then another reconnect happened.


Anyway, this is sort of a mess. I'd rather not just throw another ill-thought-out counter into the mix, but I'll agree that some way to differentiate "vanilla job restarted from the beginning" vs. "vanilla job reconnected after schedd/shadow crash" would be useful. I'll talk to some other developers and try to come up with a reasonable plan. Stay tuned.

Cheers,
-Derek