HTCondor Project List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] questions about local universe job exit semantics

Date: Tue, 26 Dec 2006 17:15:07 -0800
From: Derek Wright <wright@xxxxxxxxxxx>
Subject: [Condor-devel] questions about local universe job exit semantics

i've been looking into a problem where in the local universe, if thestarter fails to update the job info into the schedd, the starterends up just hanging, never exiting. this was reported by LIGO.

unfortunately, the fix is a little more complicated than i hoped(which is why it's taking so long), but now i'm a little bit torn onthe semantics we should attempt to guarantee in this case.


here are the sorts of things that have to happen on job exit:

a) write the execute event to the userlog

b) writing the "output classad". this is sort of gridshell-specific,but basically, you can have the starter write the final updateclassad to a file of your choice, including STDOUT, when the startersees the job is done -- this is basically all the same info as thestuff the starter normally sends to the shadow, which is then stuffedinto the userlog and job queue on the submit host. we don't makemuch use of this feature now, but we could potentially use it toavoid having the starter stick around for shadow reconnects entirely,and just save the job output classad along with the job's real outputsandbox -- whenever the shadow finally reconnects, the startd wouldjust spawn a starter (or something) to send all this data back,instead of having the starter sit there, blocking the resource frombeing used until the lease expired... but, that's another story.


c) update the canonical info in the job queue

d) email notification

currently, they happen in the above order. so, if you were unlucky,and things crashed, you could potentially see the job exited event inthe userlog, but the job was left marked running, so the job couldrun again. this is the desired behavior, since we say 2 exitedevents is better than none. you'd also see the job's output classadwritten twice (which would probably break something, i don't know ifanything/anyone can handle this case). however, if you were unluckyand crashed between (c) and (d), you could have the job exit withoutany email notification at all.

so, aside from those problems, my new problem is that the way thecode is organized makes it exceedingly difficult to *just* keepretrying (c) in a non-blocking way. :(


therefore, my choices are:

1) just keep retrying (c) in a *blocking* fashion. i.e. don't returnto daemoncore but sleep() between retries.

2) potentially repeat any/all of the above as many times as it takesfor (c) to work, but return to daemoncore each time.

3) change the order in which we're doing things, so we always try (c)first (schedd update). if it fails, and we need to re-try, we don'thit any duplicate code.

b/c of daemoncore semantics, if we're doing the blocking sleep()(option #1), we'd never notice signals from the startd/schedd toshutdown fast, only once it gave up, thought we were hung, and triedSIGKILL. :( however, other than the fast shutdown, i'm not sure whatelse we'd need to return to DaemonCore to listen for. ;) #3 raisesnasty concerns with the semantics we guarantee. i.e. if we write tothe job queue first, then get killed, we can have a job leave thequeue without a corresponding event in the userlog, which is bad. :( #2 just makes the code more ugly, but is probably the best among ahandful of rather crappy options.

unless i hear any objections, i'll probably plow ahead working on apatch for #2, see how big/complicated the diff gets, and then decideif it should go into 6.8 or 6.9...


RFC...

thanks,
-derek

Follow-Ups:
- Re: [Condor-devel] questions about local universe job exit semantics
  - From: Peter Keller

Prev by Date: Re: [Condor-devel] [Condor-team] <SUBSYSTE>_ADDRESS_FILE
Next by Date: Re: [Condor-devel] questions about local universe job exit semantics
Previous by thread: Re: [Condor-devel] [Condor-team] <SUBSYSTE>_ADDRESS_FILE
Next by thread: Re: [Condor-devel] questions about local universe job exit semantics
Index(es):
- Date
- Thread