[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Consistency problems between schedd(s) view and CM?



Hi,

Youâre right that there is a fudge factor.  It was taken over from the Torque version, where there was a known situation in which all the worker nodes âdisappearedâ from the server, but the jobs did not, and we could trust the job counts, so whenever the total capacity computed from the nodes was less than the running jobs, we set the capacity to the total running job count (or more correctly, the summed core count of all running jobs).

This doesnât work on HTCondor (and I will remove it as soon as I get time) since sometimes the node capacity is fine, itâs the job count that is wrong.  The âotherâ is calculated by subtracting the displayed running counts (the non brown colours) from the total capacity, which is again computed in some cases from the running job counts, so the âotherâ is not very trustworthy in this plot.  You can see what should be in the other by looking at the second set of plots - they donât look much like each other, which is another argument for something being wrong.

You can see the full plots here:

https://www.nikhef.nl/pdp/doc/stats/ndpf-prd-grisview-short

Thanks for all the attention,

JT


> On 3 Dec 2025, at 17:02, Steffen Grunewald <steffen.grunewald@xxxxxxxxxx> wrote:
> 
> On Wed, 2025-12-03 at 13:32:06 +0000, Bockelman, Brian wrote:
>> Under typical conditions, the time between claimed and activated is less than a second - can be hard to catch in a busy pool.
>> Under busy conditions - or if there are persistent failures in activation, the numbers diverge.  That causes a slot to be claimed - but no jobs running.  In the past, Iâve found a large discrepancy a fruitful place to dig for bugs or misconfiguration.
>> Is this a possible explanation for what youâre seeing?
> 
> Hi Brian,
> 
> I'd be surprised if a busy cluster would result in those periods of time spanning
> multiples of 10 minutes AFAICT - each of the squares in the graph grid is half an
> hour wide! (But I'm willing to learn, and I'm curious about the real cause, thus
> I'll be watching this space.)
> What really puzzles me is that only the "other" part seems to be affected (set to
> 0, instead the white stuff bubbles up). There must be something to that in
> particular...
> 
> Best,
> Steffen
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> 
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/