HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] somewhat evil problem with work fetch and schedd claims. :(




On Feb 6, 2008, at 7:55 AM, Dan Bradley wrote:

This is the part I don't understand. Why does it transition to owner if there is a preempting claim?

Sorry, I guess I should have been more specific with my terminology...

There's only an _advertised_ preempting claim at this point. If a schedd was lucky enough to beat the race and present itself to request a claim, there'd be a _pending_ preempting claim, and then, yes, once the fetched job exited or was kicked off, we'd do the usual preemption stuff and end up back in Claimed without ever going to Owner.

Here's the sequence of events I'm really talking about:

A) Startd is in Owner -- blank slate

B) Startd goes to Unclaimed, generates claimid #1, and advertises that in its ClassAd.

C) Startd fetches work, gets a job back, decides to run it, destroys claim #1 (r_cur), and creates claim #2 for the fetched job.

D) Startd transitions to Claimed/Busy, starts running fetched claim #2, and generates claim #3 as the advertised preempting claim (r_pre).

E) Meanwhile, the negotiator matches some schedd with the advertised claim #1.

F) Schedd comes to claim the startd, presents claim #1

G) Startd says "error, can't find that claim id, sorry", and refuses the claim.

H) Fetched job exits, startd returns to Owner, destroys both claim #2 and #3, and generates a new claim #4 as the advertised "opportunistic" claim and starts over again at step (B).


See what I mean?

So, from my original message, we could potentially solve (G) by saving claim #1 into r_pre when we generated claim #2 for the fetched job as r_cur. However, if (H) happens quickly, even that won't really help, since chances are low that (E) and (F) will happen fast enough to prevent everything from getting cleared out again at step (H). That's why we'd *also* need to move r_pre back to r_cur when we went back to the owner state, so that the claimid we'd advertised to the wild is still valid...

Make sense?

-Derek