[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-devel] questions about local universe job exit semantics
- Date: Tue, 26 Dec 2006 17:15:07 -0800
- From: Derek Wright <wright@xxxxxxxxxxx>
- Subject: [Condor-devel] questions about local universe job exit semantics
i've been looking into a problem where in the local universe, if the
starter fails to update the job info into the schedd, the starter
ends up just hanging, never exiting. this was reported by LIGO.
unfortunately, the fix is a little more complicated than i hoped
(which is why it's taking so long), but now i'm a little bit torn on
the semantics we should attempt to guarantee in this case.
here are the sorts of things that have to happen on job exit:
a) write the execute event to the userlog
b) writing the "output classad". this is sort of gridshell-specific,
but basically, you can have the starter write the final update
classad to a file of your choice, including STDOUT, when the starter
sees the job is done -- this is basically all the same info as the
stuff the starter normally sends to the shadow, which is then stuffed
into the userlog and job queue on the submit host. we don't make
much use of this feature now, but we could potentially use it to
avoid having the starter stick around for shadow reconnects entirely,
and just save the job output classad along with the job's real output
sandbox -- whenever the shadow finally reconnects, the startd would
just spawn a starter (or something) to send all this data back,
instead of having the starter sit there, blocking the resource from
being used until the lease expired... but, that's another story.
c) update the canonical info in the job queue
d) email notification
currently, they happen in the above order. so, if you were unlucky,
and things crashed, you could potentially see the job exited event in
the userlog, but the job was left marked running, so the job could
run again. this is the desired behavior, since we say 2 exited
events is better than none. you'd also see the job's output classad
written twice (which would probably break something, i don't know if
anything/anyone can handle this case). however, if you were unlucky
and crashed between (c) and (d), you could have the job exit without
any email notification at all.
so, aside from those problems, my new problem is that the way the
code is organized makes it exceedingly difficult to *just* keep
retrying (c) in a non-blocking way. :(
therefore, my choices are:
1) just keep retrying (c) in a *blocking* fashion. i.e. don't return
to daemoncore but sleep() between retries.
2) potentially repeat any/all of the above as many times as it takes
for (c) to work, but return to daemoncore each time.
3) change the order in which we're doing things, so we always try (c)
first (schedd update). if it fails, and we need to re-try, we don't
hit any duplicate code.
b/c of daemoncore semantics, if we're doing the blocking sleep()
(option #1), we'd never notice signals from the startd/schedd to
shutdown fast, only once it gave up, thought we were hung, and tried
SIGKILL. :( however, other than the fast shutdown, i'm not sure what
else we'd need to return to DaemonCore to listen for. ;) #3 raises
nasty concerns with the semantics we guarantee. i.e. if we write to
the job queue first, then get killed, we can have a job leave the
queue without a corresponding event in the userlog, which is bad. :
( #2 just makes the code more ugly, but is probably the best among a
handful of rather crappy options.
unless i hear any objections, i'll probably plow ahead working on a
patch for #2, see how big/complicated the diff gets, and then decide
if it should go into 6.8 or 6.9...
RFC...
thanks,
-derek