HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] somewhat evil problem with work fetch and schedd claims. :(



dang... i guess this is exactly the sort of thing you don't think about until you're debugging and see it happening live. :(

there's a annoying bug in the interaction of fetched claims and schedd-pushed ones. not sure what to do about this -- todd: i'm guessing you're going to say "screw it, not important", which is why i'm writing this up before i spend any more time on it. ;)

the startd has a Claim object that represents each claim or preempting claim. so, each slot has a few Claim* members to keep track of the current claim, and the current preempting claim (if any). so, if you're in owner, unclaimed, etc, there's just the current claim (the one you're advertising to the collector) and no preempting claim. once you're claimed, that current claim is no longer being advertised, since it's in use. however, there's now a preempting claim generated, which is advertised to the collector in case of user prio inversion or machine rank.

enter work fetching...

in the code as currently checked in to V7_1-fetchwork-branch, whenever the startd successfully invokes the fetching hook, parses a valid classad, and evaluates its policy to decide if it's willing to run it, the startd generates a claim object for that fetched claim (which stashes the job classad, etc), and stuffs that into the slot as the current claim. it then moves to the claimed state, invokes the starter, etc. as soon as the startd enters the claimed state, it automatically generates a new preemption claim and starts advertising that. as soon as that starter exits, the startd leaves claimed and goes back to the owner state, to start the whole process again.

so, here's the bug:

- the startd always clears out all the claim objects and starts fresh whenever it returns to the owner state.

- if you've got short running fetch jobs, the startd is cycling between owner and claimed frequently.

- the claims advertised to the collector are therefore always stale. : ( so, once a slot starts into fetch mode, it's basically impossible for an opportunistic job to claim that slot, even if it's ranked higher, since it never gets matched in time to claim the slot before the startd throws out that claimid and generates a new one. :(


here are the possible solutions:

A) won't fix. either have longer running jobs from your fetched work, or just be happy that your startd is so busy with your fetched work, and forget about schedd-based claims. ;)

B) have the startd special case the "fetch claim", stash it in a separate member in the slot data structure, and pepper the startd with code to see if we've got a fetch claim or not and act accordingly. ugh.

C) when the startd fetches a classad and generates a claim, save the current claim (r_cur) into the preempting claim pointer (r_pre) and put the fetch claim in r_cur. So, while the startd is running that particular fetched job, if the startd is matched with a higher ranked opportunistic job, the schedd should have a chance to attempt to claim the startd and preempt the fetched claim, since the claim id from the collector will match r_pre. however, you're still screwed once the fetched job exits, since even r_pre is destroyed at that point, too. :( so, it widens the window for the race, but doesn't eliminate it. in fact, with short running fetch jobs, this is almost irrelevant and you'll still basically never get claimed by a schedd.

D) in addition to (C), change the startd so that when it leaves the claimed state to return to owner, move the r_pre claim back to r_cur, instead of destroying everything and starting fresh. i'd have to ponder the implications of this, but it might work.


rough time estimates:
A) 0 ;)
B) ~1-2 days. :(
C) ~1 hour.
D) ~2-8 hours, depending on what's uncovered during the pondering.


Thoughts?
-Derek