HTCondor Project List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] somewhat evil problem with work fetch and schedd claims. :(

Date: Tue, 5 Feb 2008 22:03:56 -0800
From: Derek Wright <wright@xxxxxxxxxxx>
Subject: [Condor-devel] somewhat evil problem with work fetch and schedd claims. :(

dang... i guess this is exactly the sort of thing you don't thinkabout until you're debugging and see it happening live. :(

there's a annoying bug in the interaction of fetched claims andschedd-pushed ones. not sure what to do about this -- todd: i'mguessing you're going to say "screw it, not important", which is whyi'm writing this up before i spend any more time on it. ;)

the startd has a Claim object that represents each claim orpreempting claim. so, each slot has a few Claim* members to keeptrack of the current claim, and the current preempting claim (ifany). so, if you're in owner, unclaimed, etc, there's just thecurrent claim (the one you're advertising to the collector) and nopreempting claim. once you're claimed, that current claim is nolonger being advertised, since it's in use. however, there's now apreempting claim generated, which is advertised to the collector incase of user prio inversion or machine rank.


enter work fetching...

in the code as currently checked in to V7_1-fetchwork-branch,whenever the startd successfully invokes the fetching hook, parses avalid classad, and evaluates its policy to decide if it's willing torun it, the startd generates a claim object for that fetched claim(which stashes the job classad, etc), and stuffs that into the slotas the current claim. it then moves to the claimed state, invokesthe starter, etc. as soon as the startd enters the claimed state, itautomatically generates a new preemption claim and starts advertisingthat. as soon as that starter exits, the startd leaves claimed andgoes back to the owner state, to start the whole process again.


so, here's the bug:

- the startd always clears out all the claim objects and starts freshwhenever it returns to the owner state.

- if you've got short running fetch jobs, the startd is cyclingbetween owner and claimed frequently.

- the claims advertised to the collector are therefore always stale. :( so, once a slot starts into fetch mode, it's basically impossiblefor an opportunistic job to claim that slot, even if it's rankedhigher, since it never gets matched in time to claim the slot beforethe startd throws out that claimid and generates a new one. :(



here are the possible solutions:

A) won't fix. either have longer running jobs from your fetchedwork, or just be happy that your startd is so busy with your fetchedwork, and forget about schedd-based claims. ;)

B) have the startd special case the "fetch claim", stash it in aseparate member in the slot data structure, and pepper the startdwith code to see if we've got a fetch claim or not and actaccordingly. ugh.

C) when the startd fetches a classad and generates a claim, save thecurrent claim (r_cur) into the preempting claim pointer (r_pre) andput the fetch claim in r_cur. So, while the startd is running thatparticular fetched job, if the startd is matched with a higher rankedopportunistic job, the schedd should have a chance to attempt toclaim the startd and preempt the fetched claim, since the claim idfrom the collector will match r_pre. however, you're still screwedonce the fetched job exits, since even r_pre is destroyed at thatpoint, too. :( so, it widens the window for the race, but doesn'teliminate it. in fact, with short running fetch jobs, this is almostirrelevant and you'll still basically never get claimed by a schedd.

D) in addition to (C), change the startd so that when it leaves theclaimed state to return to owner, move the r_pre claim back to r_cur,instead of destroying everything and starting fresh. i'd have toponder the implications of this, but it might work.



rough time estimates:
A) 0 ;)
B) ~1-2 days. :(
C) ~1 hour.
D) ~2-8 hours, depending on what's uncovered during the pondering.


Thoughts?
-Derek

Follow-Ups:
- Re: [Condor-devel] somewhat evil problem with work fetch and schedd claims. :(
  - From: Dan Bradley
- Re: [Condor-devel] somewhat evil problem with work fetch and schedd claims. :(
  - From: Daniel Forrest

Prev by Date: [Condor-devel] pcre external (again)
Next by Date: Re: [Condor-devel] somewhat evil problem with work fetch and schedd claims. :(
Previous by thread: [Condor-devel] pcre external (again)
Next by thread: Re: [Condor-devel] somewhat evil problem with work fetch and schedd claims. :(
Index(es):
- Date
- Thread