HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] schedd not always reusing claims when it should




Just found a bug in the schedd that causes it to release claims when it still has jobs it could run on those claims. Should be easy to fix, I will fix it this morning after consulting w/ the wrangler about if it should go into v7.7.5 branch and/or into stable.

I was looking into the above cause I was getting some unexpected results testing my patch to create dynamic slots w/o a negotiation cycle. (the problem is unrelated to my patch btw)

The story is:

1. a job id x completes.
2. schedd goes through priorec array to try and find another job that matches the claimed resource. UNFORTUNATELY, it may happen that a) the priorec has yet been rebuilt (i.e. using a cached priorec array), and b) the job classad for complete job id x has not yet been destroyed, as it is waiting to be destroyed in the enqueueFinishedJob queue 3. as a result of a and b, findrunnable job will try to match *the very job that just completed* with the now idle claim 4. the job that just completed may no longer match with the claimed startd ad, especially in the case of submitting jobs w/ the defaults and using dynamic slots. For instance, machine.Memory < jobad.ImageSize, because ImageSize in the completed job is now bigger than when the resource was initially claimed (because now that the job is completed, ImageSize now reflects the real size seen by the starter and not just the size of the executable on disk). 5. findrunnable job now marks the entire autocluster as not matching this claim 6. the claim is relinquished when Todd's test really expected it should be reused :(





--
Todd Tannenbaum                       University of Wisconsin-Madison
Center for High Throughput Computing  Department of Computer Sciences
tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                 Madison, WI 53706-1685