Hello, Greg.
Here is a message of condor_q -better -machine
[kiaf@kiaf-ui ~]$ condor_q -better 525312 -machine cms-gpu01.sdfarm.kr
TARGET.Arch = "X86_64"
TARGET.Disk = 2913788
TARGET.HasFileTransfer = true
TARGET.Memory = 1024
TARGET.OpSys = "LINUX"
The Requirements _expression_ for job 525312.000 reduces to these conditions:
Slots
Step Matched Condition----- -------- ---------
[0] 26 TARGET.Arch == "X86_64"
[1] 26 TARGET.OpSys == "LINUX"
[3] 26 TARGET.Disk >= RequestDisk
[5] 26 TARGET.Memory >= RequestMemory
[7] 26 TARGET.HasFileTransfer
525312.000: Run analysis summary ignoring user priority. Of 4 machines,
0 are rejected by your job's requirements
2 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
2 are able to run your job
Hello, Greg.
That slot was also matched and running at 9:30pm last night.
[geonmo2@ifarm-ui condor_log]$ cat failed_MatchLog | grep group_alice.kiaf | tail -n 5
07/23/24 18:30:33 Matched 454744.0 group_alice.kiaf@xxxxxxxxx <134.75.125.41:9618?addrs=134.75.125.41-9618+[2001-320-15-125-ca1f-66ff-fedb-5d65]-9618&alias=kiaf-ui.sdfarm.kr&noUDP&sock=schedd_79164_7f76> preempting none <134.75.124.201:9618?addrs=134.75.124.201-9618&alias=cms-gpu01.sdfarm.kr&noUDP&sock=startd_21740_6904> slot4@xxxxxxxxxxxxxxxxxxx
07/23/24 19:00:32 Matched 454745.0 group_alice.kiaf@xxxxxxxxx <134.75.125.41:9618?addrs=134.75.125.41-9618+[2001-320-15-125-ca1f-66ff-fedb-5d65]-9618&alias=kiaf-ui.sdfarm.kr&noUDP&sock=schedd_79164_7f76> preempting none <134.75.124.201:9618?addrs=134.75.124.201-9618&alias=cms-gpu01.sdfarm.kr&noUDP&sock=startd_21740_6904> slot4@xxxxxxxxxxxxxxxxxxx
07/23/24 19:30:36 Matched 454746.0 group_alice.kiaf@xxxxxxxxx <134.75.125.41:9618?addrs=134.75.125.41-9618+[2001-320-15-125-ca1f-66ff-fedb-5d65]-9618&alias=kiaf-ui.sdfarm.kr&noUDP&sock=schedd_79164_7f76> preempting none <134.75.124.201:9618?addrs=134.75.124.201-9618&alias=cms-gpu01.sdfarm.kr&noUDP&sock=startd_21740_6904> slot4@xxxxxxxxxxxxxxxxxxx
07/23/24 20:00:40 Matched 454747.0 group_alice.kiaf@xxxxxxxxx <134.75.125.41:9618?addrs=134.75.125.41-9618+[2001-320-15-125-ca1f-66ff-fedb-5d65]-9618&alias=kiaf-ui.sdfarm.kr&noUDP&sock=schedd_79164_7f76> preempting none <134.75.124.201:9618?addrs=134.75.124.201-9618&alias=cms-gpu01.sdfarm.kr&noUDP&sock=startd_21740_6904> slot4@xxxxxxxxxxxxxxxxxxx
07/23/24 21:33:01 Matched 510407.0 group_alice.kiaf@xxxxxxxxx <134.75.125.41:9618?addrs=134.75.125.41-9618+[2001-320-15-125-ca1f-66ff-fedb-5d65]-9618&alias=kiaf-ui.sdfarm.kr&noUDP&sock=schedd_79164_7f76> preempting none <134.75.124.201:9618?addrs=134.75.124.201-9618&alias=cms-gpu01.sdfarm.kr&noUDP&sock=startd_21740_6904> slot4@xxxxxxxxxxxxxxxxxxx
However, once the SCHEDD in that group accumulates a few tens of thousands of jobs, the job matching stops happening.
After analyzing the logs, it seems that when the Negotiator requests information on the remote schedd, the job information is not being sent to Negotiator.
I'll forward you the full day's logs for now.
-- Normal case --
07/24/24 07:00:56 Phase 3: Sorting submitter ads by priority ...
07/24/24 07:00:56 Starting prefetch round; 1 potential prefetches to do.
07/24/24 07:00:56 Assigned 1 units of work for prefetching.
07/24/24 07:00:56 Starting prefetch loop.
07/24/24 07:00:56 Starting prefetch negotiation for group_genome.bio.bio@xxxxxxxxxx
07/24/24 07:00:56 Socket to group_genome.bio.bio@xxxxxxxxx (<134.75.127.179:9618?addrs=134.75.127.179-9618&alias=bio-ui7.sdfarm.kr&noUDP&sock=schedd_3395598_5f6c>) already in cache, reusing
07/24/24 07:00:56 Started NEGOTIATE with remote schedd; protocol version 1.
07/24/24 07:00:56 Sending SEND_RESOURCE_REQUEST_LIST/200/eom
07/24/24 07:00:56 Getting reply from schedd ...
07/24/24 07:00:56 Prefetch negotiation would block.
07/24/24 07:00:56 Waiting on the results of 1 negotiation sessions.
07/24/24 07:00:56 Getting reply from schedd ...
07/24/24 07:00:56 Got JOB_INFO command; getting classad/eom
07/24/24 07:00:56 Getting reply from schedd ...
07/24/24 07:00:56 Got NO_MORE_JOBS; schedd has no more requests
07/24/24 07:00:56 Send END_NEGOTIATE to remote schedd
07/24/24 07:00:56 Assigned 0 units of work for prefetching.
07/24/24 07:00:56 Prefetch summary: 1 attempted, 1 successful.
Hello, everyone.
I'm running a cluster with multi-account_groups.
Here is the information on the cluster, including quota, etc.
Hi Genomo:
Are you sure the job matches the dedicated slot? You can use
condor_q -better <job.id> -machine name_of_dedicated_machine
-greg