HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] JobRunCount



On Fri, Aug 10, 2007 at 03:08:10AM -0700, Derek Wright wrote:
> 
> Therefore, here's my current straw-man proposal:
> 
> (b) NumberJobExecuted
> -- incremented (by both shadows) every time the starter sends  
> CONDOR_begin_execution
> 

begin_execution can fail, and isn't an quick operation - I can imagine
plenty of scenarios where without some sort of two-phase update to
NumberJobExecuted it either goes up without a starter actually being
spawned, or a starter is spawned but the job ad is not updated. Is that
a big deal?  An easier counter might be "NumberMinimumAttemptsAtStarter"

> (c) NumberShadowSpawned
> -- incremented (by the schedd) every time it spawns a shadow
> 
> (d) NumberJobReconnected
> -- incremented (by the new shadow) every time it successfully  
> reconnects, regardless of if the starter was killed and restarted in  
> the meantime.  So, there'd be no counter for "number of reconnects to  
> the current starter" in my proposal, though, you could at least tell  
> how many starters are in the picture via NumberJobExecuted.
> 
> (e) NumberResumedFromCheckpoint and NumberJobRestarted
> -- std universe only: punt for now and leave NumRestarts in there as- 
> is. :(
> 

I'm fine with using Number instead of Num, and don't feel there's a
pressing need for a number of reconnects for this starter (though,
I think that's a more useful counter from a job policy standpoint than
the number of reconnects for the life of the job)

-Erik