[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-devel] JobRunCount
- Date: Fri, 10 Aug 2007 03:08:10 -0700
 
- From: Derek Wright <wright@xxxxxxxxxxx>
 
- Subject: Re: [Condor-devel] JobRunCount
 
(For those new to this thread, please see the forwarded message at  
the bottom for background -- basically, various counters in the job  
classad regarding shadow and starter activity are either misleading  
or missing, and we need to solve it).
Here are my concerns:
a) the names are all inconsistent (JobFooCount vs NumFoo vs. JobNumFoo)
b) Sanjay obviously needs a real counter for Foo = "Job started  
executing"
c) we should probably keep something for Foo = "Another shadow was  
spawned" (what "JobRunCount" currently gives you).
d) we should probably add something for Foo = "Successfully reconnected"
e) std universe should have Foo = "Resumed from checkpoint".  (b)  
should also be incremented when this happens, though perhaps we  
should also have a std-universe only Foo = "Job (re)started from the  
beginning."
Re (a): It seems like we're starting to move towards using "Job" at  
the front of job attrs like this, but I'm not convinced that's a good  
idea.  If they're in a job ad, it's obvious they're about a job.  And  
I don't see how any of this would make sense anywhere except a job  
classad.
In general for the names, I think we're better served without so many  
abbreviations.  I'd rather the attribute names were slightly longer  
so they make more immediate sense to the humans interacting with them.
Therefore, here's my current straw-man proposal:
(b) NumberJobExecuted
-- incremented (by both shadows) every time the starter sends  
CONDOR_begin_execution
(c) NumberShadowSpawned
-- incremented (by the schedd) every time it spawns a shadow
(d) NumberJobReconnected
-- incremented (by the new shadow) every time it successfully  
reconnects, regardless of if the starter was killed and restarted in  
the meantime.  So, there'd be no counter for "number of reconnects to  
the current starter" in my proposal, though, you could at least tell  
how many starters are in the picture via NumberJobExecuted.
(e) NumberResumedFromCheckpoint and NumberJobRestarted
-- std universe only: punt for now and leave NumRestarts in there as- 
is. :(
Short term, we leave JobRunCount a synonym for NumberShadowSpawned,  
but call it deprecated and encourage people to use one of these new  
Number* attrs instead.
It's just after 3am, so I'm not the most coherent I can be, but I  
don't think this is insane.  I believe it'll provide all the counters  
we know we need now and can foresee needing anytime soon.  If there  
are any last minute counter-proposals for the names and semantics,  
let's hear them, but we need this done ASAP.  The code is not that  
hard (and either Todd or I will get to it before this weekend), all  
we really need is to finalize the names.
The most obvious counter proposal is s/Number/Num/ for all of the  
above.  If someone feels strongly that saving the 3 chars per  
attribute is important, go ahead and try to convince me.  But the  
space implications are 6 bytes for your usual vanilla job, 9 for one  
that had network troubles and reconnected, and at most 12 for a std  
universe job.  I think that's a tiny price to pay (even on a huge job  
queue) for better readability.  OTOH, you could argue "Num" is 100%  
self-evident in this case.  I don't care that much either way.
The other obvious possible amendment is a  
"NumberJobReconnectedToThisStarter" (with a more catchy name).  But,  
I don't think we need it, and we shouldn't add it until someone  
claims they need it and they convince us. ;)
Cheers,
-Derek
p.s. Original email thread that started this discussion:
On Jul 24, 2007, at 9:23 AM, Derek Wright wrote:
On Jul 24, 2007, at 8:50 AM, Sanjay Padhi wrote:
 It is hard to know in Condor if the JobRunCount > 1 because of:
1. Either the startd was dead and the negotiator matched to a new  
startd.
or
2. Simply it is the same startd, but with a different shadow - 
reconnect ..
Is possible to differentiate between the two ??
Not currently, but it would be relatively easy to add another  
attribute to the job classad like JobNumReconnects or something.
Honestly, the semantics of all of these counters aren't entirely  
clear to my (still waking up) mind...
JobRunCount is really "how many times did a shadow get started for  
this job"
NumRestarts is only in standard universe, and really means "how  
many times did we resume from a checkpoint".  It's not restarting  
at all. :(
We don't have an accurate count for "how many times did a starter  
get started for this job"
JobNumReconnects would (presumably) be increasing over the life of  
the job, even if it ended up getting completely killed and  
restarted at the beginning, then another reconnect happened.
Anyway, this is sort of a mess.  I'd rather not just throw another  
ill-thought-out counter into the mix, but I'll agree that some way  
to differentiate "vanilla job restarted from the beginning" vs.  
"vanilla job reconnected after schedd/shadow crash" would be  
useful.  I'll talk to some other developers and try to come up with  
a reasonable plan.  Stay tuned.
Cheers,
-Derek