HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] proposal: PREEMPT_AND_HOLD




On Mar 3, 2008, at 3:15 PM, Dan Bradley wrote:

I agree with this in principle, but in practice, if the sane defaults
are in fact to put the job on hold and then use the hold policy to
decide what to do, then it comes down to an implementation detail.

Since I was *just* looking at all this code for the starter hook stuff, I should chime in here.

In reality, all the starter does in these error cases is send a CONDOR_ulog RSC to the shadow. This contains information about an error to log. True, if the starter puts an ATTR_HOLD_REASON_CODE in there, the shadow currently interprets that to mean "I better put this job on hold", but ultimately, the shadow doesn't have to do so. It could interpret the hold reason code (and subcode, if appropriate) and decide "nah, I'm not going to call that fatal, I'll just requeue it instead".

So, it already works exactly like Erik is saying it should. All Dan's saying is that the startd should also be able to send a similar hint to the shadow somehow. I don't see anything wrong with that on religious grounds.

In terms of how to implement this from the startd's perspective, I'm somewhat leery of going through the starter, but it could work. The deal is that currently, the startd *could* send a DC signal to the starter when the job is being put on hold. However, the existing shadow -> execute protocol is such that the startd doesn't have that information, and the job ends up being evicted with no special knowledge that it's for a hold or not. Even if we knew it was for a hold, we'd want a hold code and reason, which we can't send via a DC signal. So, we'd have to register a new DC command that the startd can send the starter to trigger a hold/preempt, and include a hold reason and string as part of the payload along with that command. Then, the starter just does what it always does when it gets the hold "signal", but it'd send the RSC to the shadow along with the hold code and string.

It really shouldn't be that bad, but *please* don't work on this at all until the hook branch is merged, since it's going to be conflict city. ;)

Cheers,
-Derek