You could try using
NEGOTIATOR_PRE_JOB_RANK
to sort the machines by memory so that higher ranked machines are matched first.
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Guilherme De Sousa <garanha@xxxxxxxxxxxx>
Sent: Thursday, March 25, 2021 1:49 PM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] How to distribute jobs round robin Nevermind!
My current pool is very heterogeneous and the hosts being chosen are the ones with far more CPU’s so that’s why it’s looping around these two!
Any suggestions on how I could also take into consideration memory availability since I’m using partitionable slots?
Best regards, Guilherme de Sousa Aranha
From: Guilherme De Sousa
Ok so after searching a bit more and changing the terminology from round robin to breadth-first (probably more accurate and correct) I found this:
https://www-auth.cs.wisc.edu/lists/htcondor-users/2016-November/msg00032.shtml
which suggests: NEGOTIATOR_PRE_JOB_RANK = 0 NEGOTIATOR_POST_JOB_RANK = +MY.Cpus
After applying this in my central manager and condor_reconfig the jobs are starting in new hosts even though they tend to loop between only 3 instead of starting in all the 9.
Can someone tell me if this is an acceptable approach? J
Best regards,
Guilherme de Sousa Aranha
I had a typo when copy pasted CLAIM_PARTITIONABLE_LEFTOVERSE (extra *E* at the end) instead of CLAIM_PARTITIONABLE_LEFTOVERS. I also did condor_reconfig, but the jobs keep starting in wrk03.
I’m pretty sure they all match the jobs; example of a better-analyze:
1107.000: Job is running.
Last successful match: Wed Mar 24 19:04:01 2021
1107.000: Run analysis summary ignoring user priority. Of 9 machines, 0 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 9 are able to run your job
I’ve also started a few big jobs to get the wrk03 full and the last job started in a new host..
Best regards, Guilherme de Sousa Aranha
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of John M Knoeller
Did you condor_reconfig after making the change?
Are you sure that all of the machine can match the jobs?
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
on behalf of Guilherme De Sousa <garanha@xxxxxxxxxxxx>
Didn’t work either… It still only starts jobs in wrk03 has you can see from condor_status
[root@srv-sub01 ~]# condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1031967112+02:41:53 slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 193385114+04:00:43 slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 131945114+04:05:09 slot1_1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 32768 2+06:59:04 slot1_2@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 2+05:48:02 slot1_3@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 1+07:08:47 slot1_4@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 1+02:30:42 slot1_5@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 0+06:38:09 slot1_6@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 0+01:13:57 slot1_7@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 0+00:00:03 slot1_8@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 0+00:00:03 slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 515937117+02:51:10 slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 515953117+02:34:03 slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 515953112+02:43:40 slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 515953117+03:00:16 slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 128617112+02:42:23 slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1031959112+02:47:07
Machines Owner Claimed Unclaimed Matched Preempting Drain
X86_64/LINUX 17 0 8 9 0 0 0
Total 17 0 8 9 0 0 0 [root@srv-sub01 ~]#
Best regards, Guilherme de Sousa Aranha From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of John M Knoeller
When using partitionable slots, The Schedd can start more than one job on a single partitionable slot for each match that it gets from the negotiator. This leads to something that appears to be depth-first matching.
If you configure
CLAIM_PARTITIONABLE_LEFTOVERSE = false
In the Schedd, then it will start only one job for each match it gets from the negotiator, and then your negotiator matching policy will have more traction.
The downside of this is that it will take many more negotiation cycles for a Schedd to fill up a partitionable slot. And if your machines are going to end up completely full anyway, this is wasted effort.
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
on behalf of Guilherme De Sousa <garanha@xxxxxxxxxxxx>
Hi Michael, |