[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Consistency problems between schedd(s) view and CM?



Coming back to this,

Iâve yet to see the case that there is a job listed by the collector (in a dynamic slot) that is NOT listed by one of the access points. On the other hand, I have seen many cases where there are jobs that an AP claims is in the running state (tested by JobStatus value) that are NOT found in the collector listing.

Iâve tested the collector-listed jobs to find out how old the data could possibly be, by taking the difference between the job start and the current time:

ep=$(date +"%s") ; condor_status -af:h $ep-JobCurrentStartDate | grep -v undef | sort -nr | tail
18
18
18
18
18
18
18
18
18
18

I never find a value above about 45 seconds.  Similar experiments on the AP view show a similar, although a bit smaller (30 seconds) number. 

My best guess at the moment, is that when a AP gets busy, handling of completed jobs has the lowest priority.  This leading to an excess number of running jobs, according to the AP. 

Weâve never seen this on our other system, so it may be an issue specific to having multiple APs.

JT


On 3 Dec 2025, at 18:48, Jeff Templon <templon@xxxxxxxxx> wrote:

Ps,

Those plots look a bit different now, thatâs because due to it being a few hours later, the âtopâ groups have changed, so some of the groups (like offline and unused) that were âtop usersâ this afternoon, have now moved down in rank, which is determined by on the left hand plots, usage the last hour, on the right hand, usage the last 24 hours.   The percentage used of the (possibly wrong) total cycles are the numbers in parentheses.

JT


On 3 Dec 2025, at 18:44, Jeff Templon <templon@xxxxxxxxx> wrote:

Hi,

Youâre right that there is a fudge factor.  It was taken over from the Torque version, where there was a known situation in which all the worker nodes âdisappearedâ from the server, but the jobs did not, and we could trust the job counts, so whenever the total capacity computed from the nodes was less than the running jobs, we set the capacity to the total running job count (or more correctly, the summed core count of all running jobs).

This doesnât work on HTCondor (and I will remove it as soon as I get time) since sometimes the node capacity is fine, itâs the job count that is wrong.  The âotherâ is calculated by subtracting the displayed running counts (the non brown colours) from the total capacity, which is again computed in some cases from the running job counts, so the âotherâ is not very trustworthy in this plot.  You can see what should be in the other by looking at the second set of plots - they donât look much like each other, which is another argument for something being wrong.

You can see the full plots here:

https://www.nikhef.nl/pdp/doc/stats/ndpf-prd-grisview-short

Thanks for all the attention,

JT


On 3 Dec 2025, at 17:02, Steffen Grunewald <steffen.grunewald@xxxxxxxxxx> wrote:

On Wed, 2025-12-03 at 13:32:06 +0000, Bockelman, Brian wrote:
Under typical conditions, the time between claimed and activated is less than a second - can be hard to catch in a busy pool.
Under busy conditions - or if there are persistent failures in activation, the numbers diverge.  That causes a slot to be claimed - but no jobs running.  In the past, Iâve found a large discrepancy a fruitful place to dig for bugs or misconfiguration.
Is this a possible explanation for what youâre seeing?

Hi Brian,

I'd be surprised if a busy cluster would result in those periods of time spanning
multiples of 10 minutes AFAICT - each of the squares in the graph grid is half an
hour wide! (But I'm willing to learn, and I'm curious about the real cause, thus
I'll be watching this space.)
What really puzzles me is that only the "other" part seems to be affected (set to
0, instead the white stuff bubbles up). There must be something to that in
particular...

Best,
Steffen
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/