[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] [RE][RE]Re: About negotiation opportunity conditions in amulti-accounting group environment



Hello, Greg.


Here is a message of condor_q -better -machine


[kiaf@kiaf-ui ~]$ condor_q -better 525312 -machine cms-gpu01.sdfarm.kr

....
slot4@xxxxxxxxxxxxxxxxxxx has the following attributes:


    TARGET.Arch = "X86_64"

    TARGET.Disk = 2913788

    TARGET.HasFileTransfer = true

    TARGET.Memory = 1024

    TARGET.OpSys = "LINUX"


The Requirements _expression_ for job 525312.000 reduces to these conditions:


         Slots

Step    Matched  Condition-----  --------  ---------

[0]          26  TARGET.Arch == "X86_64"

[1]          26  TARGET.OpSys == "LINUX"

[3]          26  TARGET.Disk >= RequestDisk

[5]          26  TARGET.Memory >= RequestMemory

[7]          26  TARGET.HasFileTransfer



525312.000:  Run analysis summary ignoring user priority.  Of 4 machines,

      0 are rejected by your job's requirements

      2 reject your job because of their own requirements

      0 match and are already running your jobs

      0 match but are serving other users

      2 are able to run your job


cms-gpu01's slot2 and 3 is rejected by START classads for other accounting group.



----- Original Message -----
From : Geonmo Ryu <geonmo@xxxxxxxxxxx>
To : HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc : Greg Thain <gthain@xxxxxxxxxxx>
Sent : 2024-07-24 09:56:30
Subject : [RE]Re: [HTCondor-users] About negotiation opportunity conditions in amulti-accounting group environment


Hello, Greg.


That slot was also matched and running at 9:30pm last night.


[geonmo2@ifarm-ui condor_log]$ cat failed_MatchLog | grep group_alice.kiaf | tail -n 5

07/23/24 18:30:33       Matched 454744.0 group_alice.kiaf@xxxxxxxxx <134.75.125.41:9618?addrs=134.75.125.41-9618+[2001-320-15-125-ca1f-66ff-fedb-5d65]-9618&alias=kiaf-ui.sdfarm.kr&noUDP&sock=schedd_79164_7f76> preempting none <134.75.124.201:9618?addrs=134.75.124.201-9618&alias=cms-gpu01.sdfarm.kr&noUDP&sock=startd_21740_6904> slot4@xxxxxxxxxxxxxxxxxxx

07/23/24 19:00:32       Matched 454745.0 group_alice.kiaf@xxxxxxxxx <134.75.125.41:9618?addrs=134.75.125.41-9618+[2001-320-15-125-ca1f-66ff-fedb-5d65]-9618&alias=kiaf-ui.sdfarm.kr&noUDP&sock=schedd_79164_7f76> preempting none <134.75.124.201:9618?addrs=134.75.124.201-9618&alias=cms-gpu01.sdfarm.kr&noUDP&sock=startd_21740_6904> slot4@xxxxxxxxxxxxxxxxxxx

07/23/24 19:30:36       Matched 454746.0 group_alice.kiaf@xxxxxxxxx <134.75.125.41:9618?addrs=134.75.125.41-9618+[2001-320-15-125-ca1f-66ff-fedb-5d65]-9618&alias=kiaf-ui.sdfarm.kr&noUDP&sock=schedd_79164_7f76> preempting none <134.75.124.201:9618?addrs=134.75.124.201-9618&alias=cms-gpu01.sdfarm.kr&noUDP&sock=startd_21740_6904> slot4@xxxxxxxxxxxxxxxxxxx

07/23/24 20:00:40       Matched 454747.0 group_alice.kiaf@xxxxxxxxx <134.75.125.41:9618?addrs=134.75.125.41-9618+[2001-320-15-125-ca1f-66ff-fedb-5d65]-9618&alias=kiaf-ui.sdfarm.kr&noUDP&sock=schedd_79164_7f76> preempting none <134.75.124.201:9618?addrs=134.75.124.201-9618&alias=cms-gpu01.sdfarm.kr&noUDP&sock=startd_21740_6904> slot4@xxxxxxxxxxxxxxxxxxx

07/23/24 21:33:01       Matched 510407.0 group_alice.kiaf@xxxxxxxxx <134.75.125.41:9618?addrs=134.75.125.41-9618+[2001-320-15-125-ca1f-66ff-fedb-5d65]-9618&alias=kiaf-ui.sdfarm.kr&noUDP&sock=schedd_79164_7f76> preempting none <134.75.124.201:9618?addrs=134.75.124.201-9618&alias=cms-gpu01.sdfarm.kr&noUDP&sock=startd_21740_6904> slot4@xxxxxxxxxxxxxxxxxxx



However, once the SCHEDD in that group accumulates a few tens of thousands of jobs, the job matching stops happening.


After analyzing the logs, it seems that when the Negotiator requests information on the remote schedd, the job information is not being sent to Negotiator.


I'll forward you the full day's logs for now.


-- Normal case --

07/24/24 07:00:56 Phase 3:  Sorting submitter ads by priority ...

07/24/24 07:00:56 Starting prefetch round; 1 potential prefetches to do.

07/24/24 07:00:56 Assigned 1 units of work for prefetching.

07/24/24 07:00:56 Starting prefetch loop.

07/24/24 07:00:56 Starting prefetch negotiation for group_genome.bio.bio@xxxxxxxxxx

07/24/24 07:00:56 Socket to group_genome.bio.bio@xxxxxxxxx (<134.75.127.179:9618?addrs=134.75.127.179-9618&alias=bio-ui7.sdfarm.kr&noUDP&sock=schedd_3395598_5f6c>) already in cache, reusing

07/24/24 07:00:56 Started NEGOTIATE with remote schedd; protocol version 1.

07/24/24 07:00:56     Sending SEND_RESOURCE_REQUEST_LIST/200/eom

07/24/24 07:00:56     Getting reply from schedd ...

07/24/24 07:00:56 Prefetch negotiation would block.

07/24/24 07:00:56 Waiting on the results of 1 negotiation sessions.

07/24/24 07:00:56     Getting reply from schedd ...

07/24/24 07:00:56     Got JOB_INFO command; getting classad/eom

07/24/24 07:00:56     Getting reply from schedd ...

07/24/24 07:00:56     Got NO_MORE_JOBS;  schedd has no more requests

07/24/24 07:00:56     Send END_NEGOTIATE to remote schedd

07/24/24 07:00:56 Assigned 0 units of work for prefetching.

07/24/24 07:00:56 Prefetch summary: 1 attempted, 1 successful.


-- Abnormal case --
07/24/24 07:00:56 Phase 3:  Sorting submitter ads by priority ...
07/24/24 07:00:56 Starting prefetch round; 2 potential prefetches to do.
07/24/24 07:00:56 Assigned 1 units of work for prefetching.
07/24/24 07:00:56 Starting prefetch loop.
07/24/24 07:00:56 Starting prefetch negotiation for group_alice.kiaf@xxxxxxxxxx
07/24/24 07:00:56 Socket to group_alice.kiaf@xxxxxxxxx (<134.75.125.41:9618?addrs=134.75.125.41-9618+[2001-320-15-125-ca1f-66ff-fedb-5d65]-9618&alias=kiaf-ui.sdfarm.kr&noUDP&sock=schedd_79164_7f76>) already in cache, reusing
07/24/24 07:00:56 Started NEGOTIATE with remote schedd; protocol version 1.
07/24/24 07:00:56     Sending SEND_RESOURCE_REQUEST_LIST/200/eom
07/24/24 07:00:56     Getting reply from schedd ...
07/24/24 07:00:56 Prefetch negotiation would block.
07/24/24 07:00:56 Waiting on the results of 1 negotiation sessions.
07/24/24 07:00:57     Getting reply from schedd ...
07/24/24 07:00:57     Got NO_MORE_JOBS;  schedd has no more requests
07/24/24 07:00:57 Assigned 1 units of work for prefetching.
07/24/24 07:00:57 Starting prefetch loop.
07/24/24 07:00:57 Starting prefetch negotiation for group_alice.xhaxha@xxxxxxxxxx
07/24/24 07:00:57 Socket to group_alice.xhaxha@xxxxxxxxx (<134.75.125.41:9618?addrs=134.75.125.41-9618+[2001-320-15-125-ca1f-66ff-fedb-5d65]-9618&alias=kiaf-ui.sdfarm.kr&noUDP&sock=schedd_79164_7f76>) already in cache, reusing
07/24/24 07:00:57 Started NEGOTIATE with remote schedd; protocol version 1.
07/24/24 07:00:57     Sending SEND_RESOURCE_REQUEST_LIST/200/eom
07/24/24 07:00:57     Getting reply from schedd ...
07/24/24 07:00:57 Prefetch negotiation would block.
07/24/24 07:00:57 Waiting on the results of 1 negotiation sessions.
07/24/24 07:00:57     Getting reply from schedd ...
07/24/24 07:00:57     Got NO_MORE_JOBS;  schedd has no more requests
07/24/24 07:00:57 Assigned 0 units of work for prefetching.
07/24/24 07:00:57 Prefetch summary: 2 attempted, 2 successful.





----- Original Message -----
From : Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx>
To : <htcondor-users@xxxxxxxxxxx>
Cc : "Greg Thain" <gthain@xxxxxxxxxxx>
Sent : 2024-07-24 07:22:36
Subject : Re: [HTCondor-users] About negotiation opportunity conditions in amulti-accounting group environment


On 7/18/24 21:03, Geonmo Ryu wrote:

Hello, everyone.


I'm running a cluster with multi-account_groups. 


Here is the information on the cluster, including quota, etc.


Hi Genomo:


Are you sure the job matches the dedicated slot?  You can use


condor_q -better <job.id> -machine name_of_dedicated_machine


-greg