HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] schedd not always reusing claims when it should



On 01/30/2012 10:52 AM, Todd Tannenbaum wrote:

Just found a bug in the schedd that causes it to release claims when it
still has jobs it could run on those claims. Should be easy to fix, I
will fix it this morning after consulting w/ the wrangler about if it
should go into v7.7.5 branch and/or into stable.

I was looking into the above cause I was getting some unexpected results
testing my patch to create dynamic slots w/o a negotiation cycle. (the
problem is unrelated to my patch btw)

The story is:

1. a job id x completes.
2. schedd goes through priorec array to try and find another job that
matches the claimed resource. UNFORTUNATELY, it may happen that
a) the priorec has yet been rebuilt (i.e. using a cached priorec array),
and
b) the job classad for complete job id x has not yet been destroyed, as
it is waiting to be destroyed in the enqueueFinishedJob queue
3. as a result of a and b, findrunnable job will try to match *the very
job that just completed* with the now idle claim
4. the job that just completed may no longer match with the claimed
startd ad, especially in the case of submitting jobs w/ the defaults and
using dynamic slots. For instance, machine.Memory < jobad.ImageSize,
because ImageSize in the completed job is now bigger than when the
resource was initially claimed (because now that the job is completed,
ImageSize now reflects the real size seen by the starter and not just
the size of the executable on disk).
5. findrunnable job now marks the entire autocluster as not matching
this claim
6. the claim is relinquished when Todd's test really expected it should
be reused :(

Does this impact 7.6?

Best,


matt