[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-devel] somewhat evil problem with work fetch and schedd claims. :(
- Date: Tue, 5 Feb 2008 22:03:56 -0800
- From: Derek Wright <wright@xxxxxxxxxxx>
- Subject: [Condor-devel] somewhat evil problem with work fetch and schedd claims. :(
dang... i guess this is exactly the sort of thing you don't think
about until you're debugging and see it happening live. :(
there's a annoying bug in the interaction of fetched claims and
schedd-pushed ones. not sure what to do about this -- todd: i'm
guessing you're going to say "screw it, not important", which is why
i'm writing this up before i spend any more time on it. ;)
the startd has a Claim object that represents each claim or
preempting claim. so, each slot has a few Claim* members to keep
track of the current claim, and the current preempting claim (if
any). so, if you're in owner, unclaimed, etc, there's just the
current claim (the one you're advertising to the collector) and no
preempting claim. once you're claimed, that current claim is no
longer being advertised, since it's in use. however, there's now a
preempting claim generated, which is advertised to the collector in
case of user prio inversion or machine rank.
enter work fetching...
in the code as currently checked in to V7_1-fetchwork-branch,
whenever the startd successfully invokes the fetching hook, parses a
valid classad, and evaluates its policy to decide if it's willing to
run it, the startd generates a claim object for that fetched claim
(which stashes the job classad, etc), and stuffs that into the slot
as the current claim. it then moves to the claimed state, invokes
the starter, etc. as soon as the startd enters the claimed state, it
automatically generates a new preemption claim and starts advertising
that. as soon as that starter exits, the startd leaves claimed and
goes back to the owner state, to start the whole process again.
so, here's the bug:
- the startd always clears out all the claim objects and starts fresh
whenever it returns to the owner state.
- if you've got short running fetch jobs, the startd is cycling
between owner and claimed frequently.
- the claims advertised to the collector are therefore always stale. :
( so, once a slot starts into fetch mode, it's basically impossible
for an opportunistic job to claim that slot, even if it's ranked
higher, since it never gets matched in time to claim the slot before
the startd throws out that claimid and generates a new one. :(
here are the possible solutions:
A) won't fix. either have longer running jobs from your fetched
work, or just be happy that your startd is so busy with your fetched
work, and forget about schedd-based claims. ;)
B) have the startd special case the "fetch claim", stash it in a
separate member in the slot data structure, and pepper the startd
with code to see if we've got a fetch claim or not and act
accordingly. ugh.
C) when the startd fetches a classad and generates a claim, save the
current claim (r_cur) into the preempting claim pointer (r_pre) and
put the fetch claim in r_cur. So, while the startd is running that
particular fetched job, if the startd is matched with a higher ranked
opportunistic job, the schedd should have a chance to attempt to
claim the startd and preempt the fetched claim, since the claim id
from the collector will match r_pre. however, you're still screwed
once the fetched job exits, since even r_pre is destroyed at that
point, too. :( so, it widens the window for the race, but doesn't
eliminate it. in fact, with short running fetch jobs, this is almost
irrelevant and you'll still basically never get claimed by a schedd.
D) in addition to (C), change the startd so that when it leaves the
claimed state to return to owner, move the r_pre claim back to r_cur,
instead of destroying everything and starting fresh. i'd have to
ponder the implications of this, but it might work.
rough time estimates:
A) 0 ;)
B) ~1-2 days. :(
C) ~1 hour.
D) ~2-8 hours, depending on what's uncovered during the pondering.
Thoughts?
-Derek