Hi Max. I presume from this command
condor_q
-global atlasprd ...
That the one of users in question is atlasprd. Those condor_q totals are showing
661+114+62+8
jobs running across all of those schedds.
That is consistent with the negotiator seeing 1400 cores claimed by jobs in the Atlas
group. The schedd shows jobs in "running" state
as soon as it has matches and is trying to start the jobs on those matches. There
shouldâ be something in the Sched log at that point, although possibly
not at the default log level. It might be more productive to track this from the
execute side however.
Try running
condor_status -claimed
And pick out one of the machines that has claimed slots for that user. Then go to that
machine and look at the StartLog and StarterLog.*
You might also try running condor_who on that machine. I would do that first, to
see if you catch any slots showing up as claimed by that user.
Best guess is that the Schedd is repeatedly trying and failing to start jobs on those
machines, which would show up as activity mostly in the StartLog
and possibly the StarterLog. If you have no messages in the ShadowLog on the schedd
side, then logging on the execute side will be in the StartLog. If the
process of starting a job gets further, then logging moves to the ShadowLog on the AP
and StarterLog on the EP.
In answer to your question, yes. it is possible for specific jobs to have resource requests
that will match a partitionable slot, but not match the dynamic slot
which is created to satisfy the resource request. This problem will show up in the
StartLog. In more recent versions of HTCondor the logging for this sort
of failure is better, but in the older versions the logging does exist, it's just not
as directly helpful.
hope this helps.
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Fischer, Max (SCC) <max.fischer@xxxxxxx>
Sent: Friday, February 2, 2024 4:52 AM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: [HTCondor-users] Jobs of single group getting "lost" in negotiation-claim progress Hi all,
our cluster is in a very weird state that I have never seen before. No idea how to reproduce this but hoping anyone has ideas how to fix it. Weâre still stuck on the HTCondor 9.X series due to user requirements. Since about three days, we observe weird state for only a single group (ATLAS) out of roughly a dozen in total, which all work fine. What we observe is as follows: 1. The group is under its relative quota and the Negotiator is prioritising matching its jobs. - Jobs get matched to StartDs, as shown in the Negotiator log. 2. Once a job has been matched, it changes to `NumJobMatches = 0` in the queue *but does not start running*. - We see nothing in the Schedd, Shadow, Startd nor Starter logs for this job at this point. - The job is then stuck in this state, neither starting nor timing out a claim nor being re-matched. - The job also has an attribute `Matched = true` which isnât documented anywhere. 3. The Negotiator slowly reduces the count of 'Claimed Coresâ and âRequested Coresâ. - Consequently, it stops matching jobs of this group because it doesnât see them anymore. - We estimated that this matches the Negotiator plain ignoring jobs in this stuck state. - As an example, at the same time the Negotiator sees jobs equaling 1400 slots [0] but the Schedds see about 3000 jobs total [1] requesting even more slots The points 2. and 3. are kinda problematic for us. ^^ What we especially donât get is that things are working perfectly fine for other groups. We have no special provisions (e.g. START, Requirements, GroupSortExpr, etc.) based on specific groups anywhere in the cluster. Is there anything on *individual* jobs that could lead to such behaviour? Could there be some attributes that can interfere with jobs starting? Cheers, Max [0] NegotiatorLog 02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Group Computed Config Quota Use Auto Claimed Requestd SubmtersAllocatd 02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Name quota quota static surplus Regroup cores cores in group cores 02/02/24 11:20:37 (pid:69874) (D_ALWAYS) ---------------------------------------------------------------------------------------------------- 02/02/24 11:20:37 (pid:69874) (D_ALWAYS) <none> 7.27596e-12 0 N Y Y 0 0 33 0 02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Alice 12323 0.237195 N Y Y 16312 17905 4 0 02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Atlas 15403.7 0.296493 N Y Y 1400 1400 4 0 02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Auger 167.185 0.003218 N Y Y 127 606 4 0 02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Babar 168.899 0.003251 N Y Y 0 0 0 0 02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Belle 2524.03 0.048583 N Y Y 593 592 4 0 02/02/24 11:20:37 (pid:69874) (D_ALWAYS) CMS 6893.96 0.132696 N Y Y 17688 30488 4 0 [1] # condor_q -global atlasprd -allusers -total -- Schedd: htcondor-ce-4-kit.gridka.de : <[2a00:139c:3:2e5:0:61:6a:7b]:9618?... @ 02/02/24 11:34:26 Total for query: 1539 jobs; 0 completed, 0 removed, 878 idle, 661 running, 0 held, 0 suspended Total for all users: 6221 jobs; 3 completed, 0 removed, 1525 idle, 4639 running, 54 held, 0 suspended -- Schedd: htcondor-ce-3-kit.gridka.de : <[2a00:139c:3:2e5:0:61:6a:7d]:9618?... @ 02/02/24 11:34:26 Total for query: 1112 jobs; 0 completed, 0 removed, 998 idle, 114 running, 0 held, 0 suspended Total for all users: 6057 jobs; 10 completed, 0 removed, 1639 idle, 4353 running, 55 held, 0 suspended -- Schedd: pps-htcondor-ce.gridka.de : <[2a00:139c:3:2e5:0:61:d2:6c]:9618?... @ 02/02/24 11:34:26 Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 33 jobs; 0 completed, 0 removed, 0 idle, 0 running, 33 held, 0 suspended -- Schedd: htcondor-ce-1-kit.gridka.de : <[2a00:139c:3:2e5:0:61:1:6a]:9618?... @ 02/02/24 11:34:26 Total for query: 214 jobs; 0 completed, 0 removed, 152 idle, 62 running, 0 held, 0 suspended Total for all users: 5687 jobs; 4 completed, 0 removed, 750 idle, 4933 running, 0 held, 0 suspended -- Schedd: pps-token-htcondor-ce.gridka.de : <[2a00:139c:3:2e5:0:61:d2:8e]:9618?... @ 02/02/24 11:34:26 Total for query: 12 jobs; 0 completed, 0 removed, 12 idle, 0 running, 0 held, 0 suspended Total for all users: 12 jobs; 0 completed, 0 removed, 12 idle, 0 running, 0 held, 0 suspended -- Schedd: htcondor-ce-2-kit.gridka.de : <[2a00:139c:3:2e5:0:61:1:6c]:9618?... @ 02/02/24 11:34:26 Total for query: 43 jobs; 0 completed, 0 removed, 35 idle, 8 running, 0 held, 0 suspended Total for all users: 5053 jobs; 12 completed, 0 removed, 546 idle, 4438 running, 57 held, 0 suspended _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ |