HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] questions about local universe job exit semantics




On Dec 26, 2006, at 11:31 PM, Peter Keller wrote:

You should probably do something very similar, since it would be identical
behavior to the shadow semantics.

perhaps. however, in a way, this just "squeezes the balloon", since the problem comes when there's a timeout talking to the schedd to update the job queue. now, instead of this causing problems when updating the final job info into the job queue (exit status, imagesize, etc), we hit the timeout writing the terminate_pending state change. so, what do we do then? just give up and re-run the job? that seems wasteful and lame. so, we have the exact same problems i mentioned in my original email (blocking vs. non-blocking retries, how to avoid duplication of side-effects, etc), just for a different thing we're updating the job queue about. :(

also, keep in mind this is *not* the job queue update to change the state. the schedd is still responsible for that, pending the exit of the starter (or shadow).

so, while it might be a good idea to introduce a terminate_pending state into local universe, i'm not convinced a) it actually solves any problems we currently have or b) is worth doing in the middle of a stable series.

thanks, though, that's a good thing to consider...

any other thoughts?

-derek