Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Jobs of single group getting "lost" in negotiation-claim progress
- Date: Fri, 2 Feb 2024 10:52:22 +0000
- From: "Fischer, Max (SCC)" <max.fischer@xxxxxxx>
- Subject: [HTCondor-users] Jobs of single group getting "lost" in negotiation-claim progress
Hi all,
our cluster is in a very weird state that I have never seen before. No idea how to reproduce this but hoping anyone has ideas how to fix it.
Weâre still stuck on the HTCondor 9.X series due to user requirements.
Since about three days, we observe weird state for only a single group (ATLAS) out of roughly a dozen in total, which all work fine. What we observe is as follows:
1. The group is under its relative quota and the Negotiator is prioritising matching its jobs.
- Jobs get matched to StartDs, as shown in the Negotiator log.
2. Once a job has been matched, it changes to `NumJobMatches = 0` in the queue *but does not start running*.
- We see nothing in the Schedd, Shadow, Startd nor Starter logs for this job at this point.
- The job is then stuck in this state, neither starting nor timing out a claim nor being re-matched.
- The job also has an attribute `Matched = true` which isnât documented anywhere.
3. The Negotiator slowly reduces the count of 'Claimed Coresâ and âRequested Coresâ.
- Consequently, it stops matching jobs of this group because it doesnât see them anymore.
- We estimated that this matches the Negotiator plain ignoring jobs in this stuck state.
- As an example, at the same time the Negotiator sees jobs equaling 1400 slots [0] but the Schedds see about 3000 jobs total [1] requesting even more slots
The points 2. and 3. are kinda problematic for us. ^^
What we especially donât get is that things are working perfectly fine for other groups. We have no special provisions (e.g. START, Requirements, GroupSortExpr, etc.) based on specific groups anywhere in the cluster.
Is there anything on *individual* jobs that could lead to such behaviour? Could there be some attributes that can interfere with jobs starting?
Cheers,
Max
[0] NegotiatorLog
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Group Computed Config Quota Use Auto Claimed Requestd SubmtersAllocatd
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Name quota quota static surplus Regroup cores cores in group cores
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) ----------------------------------------------------------------------------------------------------
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) <none> 7.27596e-12 0 N Y Y 0 0 33 0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Alice 12323 0.237195 N Y Y 16312 17905 4 0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Atlas 15403.7 0.296493 N Y Y 1400 1400 4 0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Auger 167.185 0.003218 N Y Y 127 606 4 0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Babar 168.899 0.003251 N Y Y 0 0 0 0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Belle 2524.03 0.048583 N Y Y 593 592 4 0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) CMS 6893.96 0.132696 N Y Y 17688 30488 4 0
[1] # condor_q -global atlasprd -allusers -total
-- Schedd: htcondor-ce-4-kit.gridka.de : <[2a00:139c:3:2e5:0:61:6a:7b]:9618?... @ 02/02/24 11:34:26
Total for query: 1539 jobs; 0 completed, 0 removed, 878 idle, 661 running, 0 held, 0 suspended
Total for all users: 6221 jobs; 3 completed, 0 removed, 1525 idle, 4639 running, 54 held, 0 suspended
-- Schedd: htcondor-ce-3-kit.gridka.de : <[2a00:139c:3:2e5:0:61:6a:7d]:9618?... @ 02/02/24 11:34:26
Total for query: 1112 jobs; 0 completed, 0 removed, 998 idle, 114 running, 0 held, 0 suspended
Total for all users: 6057 jobs; 10 completed, 0 removed, 1639 idle, 4353 running, 55 held, 0 suspended
-- Schedd: pps-htcondor-ce.gridka.de : <[2a00:139c:3:2e5:0:61:d2:6c]:9618?... @ 02/02/24 11:34:26
Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 33 jobs; 0 completed, 0 removed, 0 idle, 0 running, 33 held, 0 suspended
-- Schedd: htcondor-ce-1-kit.gridka.de : <[2a00:139c:3:2e5:0:61:1:6a]:9618?... @ 02/02/24 11:34:26
Total for query: 214 jobs; 0 completed, 0 removed, 152 idle, 62 running, 0 held, 0 suspended
Total for all users: 5687 jobs; 4 completed, 0 removed, 750 idle, 4933 running, 0 held, 0 suspended
-- Schedd: pps-token-htcondor-ce.gridka.de : <[2a00:139c:3:2e5:0:61:d2:8e]:9618?... @ 02/02/24 11:34:26
Total for query: 12 jobs; 0 completed, 0 removed, 12 idle, 0 running, 0 held, 0 suspended
Total for all users: 12 jobs; 0 completed, 0 removed, 12 idle, 0 running, 0 held, 0 suspended
-- Schedd: htcondor-ce-2-kit.gridka.de : <[2a00:139c:3:2e5:0:61:1:6c]:9618?... @ 02/02/24 11:34:26
Total for query: 43 jobs; 0 completed, 0 removed, 35 idle, 8 running, 0 held, 0 suspended
Total for all users: 5053 jobs; 12 completed, 0 removed, 546 idle, 4438 running, 57 held, 0 suspended