HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] questions about local universe job exit semantics



On Tue, Dec 26, 2006 at 05:15:07PM -0800, Derek Wright wrote:
> 3) change the order in which we're doing things, so we always try (c)  
> first (schedd update).  if it fails, and we need to re-try, we don't  
> hit any duplicate code.

This one is similar to the "pending termination" state the new shadow
performs when writing the termination event. Basically:

0) Upon shadow startup, if job ad is in termination pending state, then 
	goto 3.

1) job starts, executes, then terminates

2) shadow puts TERMINATE_PENDING = TRUE into classad and updates it to schedd,
	excepting if update fails(timeout, etc).

3) shadow writes termination event to log

4) shadow sends email

If anything goes wrong, start over from the begining. This means if 3 or 4 fail
then at step 0, you just start long enough to write the terminate event.
You'll get more than one termination event and/or email. But at least for
the shadow case, it was determined that this is far more desireable than simply
wrong behavior.

You should probably do something very similar, since it would be identical
behavior to the shadow semantics. Also the starter is acting like the
shadow in this sense, so it should do the same thing.

Thank you.

-pete