HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] questions about local universe job exit semantics




i've been looking into a problem where in the local universe, if the starter fails to update the job info into the schedd, the starter ends up just hanging, never exiting. this was reported by LIGO.

unfortunately, the fix is a little more complicated than i hoped (which is why it's taking so long), but now i'm a little bit torn on the semantics we should attempt to guarantee in this case.

here are the sorts of things that have to happen on job exit:

a) write the execute event to the userlog

b) writing the "output classad". this is sort of gridshell-specific, but basically, you can have the starter write the final update classad to a file of your choice, including STDOUT, when the starter sees the job is done -- this is basically all the same info as the stuff the starter normally sends to the shadow, which is then stuffed into the userlog and job queue on the submit host. we don't make much use of this feature now, but we could potentially use it to avoid having the starter stick around for shadow reconnects entirely, and just save the job output classad along with the job's real output sandbox -- whenever the shadow finally reconnects, the startd would just spawn a starter (or something) to send all this data back, instead of having the starter sit there, blocking the resource from being used until the lease expired... but, that's another story.

c) update the canonical info in the job queue

d) email notification


currently, they happen in the above order. so, if you were unlucky, and things crashed, you could potentially see the job exited event in the userlog, but the job was left marked running, so the job could run again. this is the desired behavior, since we say 2 exited events is better than none. you'd also see the job's output classad written twice (which would probably break something, i don't know if anything/anyone can handle this case). however, if you were unlucky and crashed between (c) and (d), you could have the job exit without any email notification at all.

so, aside from those problems, my new problem is that the way the code is organized makes it exceedingly difficult to *just* keep retrying (c) in a non-blocking way. :(

therefore, my choices are:

1) just keep retrying (c) in a *blocking* fashion. i.e. don't return to daemoncore but sleep() between retries.

2) potentially repeat any/all of the above as many times as it takes for (c) to work, but return to daemoncore each time.

3) change the order in which we're doing things, so we always try (c) first (schedd update). if it fails, and we need to re-try, we don't hit any duplicate code.


b/c of daemoncore semantics, if we're doing the blocking sleep() (option #1), we'd never notice signals from the startd/schedd to shutdown fast, only once it gave up, thought we were hung, and tried SIGKILL. :( however, other than the fast shutdown, i'm not sure what else we'd need to return to DaemonCore to listen for. ;) #3 raises nasty concerns with the semantics we guarantee. i.e. if we write to the job queue first, then get killed, we can have a job leave the queue without a corresponding event in the userlog, which is bad. : ( #2 just makes the code more ugly, but is probably the best among a handful of rather crappy options.

unless i hear any objections, i'll probably plow ahead working on a patch for #2, see how big/complicated the diff gets, and then decide if it should go into 6.8 or 6.9...

RFC...

thanks,
-derek