Re: [HTCondor-users] How to distribute jobs round robin

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Nevermind!

My current pool is very heterogeneous and the hosts being chosen are the ones with far more CPU’s so that’s why it’s looping around these two!

Any suggestions on how I could also take into consideration memory availability since I’m using partitionable slots?

Best regards,

Guilherme de Sousa Aranha

BANCO DE PORTUGAL
Departamento de Sistemas e Tecnologias de Informação / Systems and Information Technology Department
DSITI/ESA - Engenharia de Sistemas Aplicacionais

Rua Francisco Ribeiro, 2 | 1150-165 Lisboa
Ext. 20792
garanha@xxxxxxxxxxxx | www.bportugal.pt

From: Guilherme De Sousa
Sent: 25 de março de 2021 18:46
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: RE: How to distribute jobs round robin

Ok so after searching a bit more and changing the terminology from round robin to breadth-first (probably more accurate and correct) I found this:

https://www-auth.cs.wisc.edu/lists/htcondor-users/2016-November/msg00032.shtml

which suggests:

NEGOTIATOR_PRE_JOB_RANK = 0

NEGOTIATOR_POST_JOB_RANK = +MY.Cpus

After applying this in my central manager and condor_reconfig the jobs are starting in new hosts even though they tend to loop between only 3 instead of starting in all the 9.

Can someone tell me if this is an acceptable approach? J

Best regards,

Guilherme de Sousa Aranha

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Guilherme De Sousa
Sent: 24 de março de 2021 19:06
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to distribute jobs round robin

I had a typo when copy pasted CLAIM_PARTITIONABLE_LEFTOVERSE (extra *E* at the end) instead of CLAIM_PARTITIONABLE_LEFTOVERS.

I also did condor_reconfig, but the jobs keep starting in wrk03.

I’m pretty sure they all match the jobs; example of a better-analyze:

1107.000: Job is running.

Last successful match: Wed Mar 24 19:04:01 2021

1107.000: Run analysis summary ignoring user priority. Of 9 machines,

0 are rejected by your job's requirements

0 reject your job because of their own requirements

0 match and are already running your jobs

0 match but are serving other users

9 are able to run your job

I’ve also started a few big jobs to get the wrk03 full and the last job started in a new host..

Best regards,

Rua Francisco Ribeiro, 2 | 1150-165 Lisboa
Ext. 20792
garanha@xxxxxxxxxxxx | www.bportugal.pt

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller
Sent: 24 de março de 2021 18:11
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to distribute jobs round robin

ATENÇÃO: Este email é de origem externa. Tenha especial atenção a qualquer anexo ou hiperligação existente neste email.

Did you condor_reconfig after making the change?

Are you sure that all of the machine can match the jobs?

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Guilherme De Sousa <garanha@xxxxxxxxxxxx>
Sent: Wednesday, March 24, 2021 12:32 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to distribute jobs round robin

Didn’t work either…

It still only starts jobs in wrk03 has you can see from condor_status

[root@srv-sub01 ~]# condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1031967112+02:41:53

slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 193385114+04:00:43

slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 131945114+04:05:09

slot1_1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 32768 2+06:59:04

slot1_2@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 2+05:48:02

slot1_3@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 1+07:08:47

slot1_4@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 1+02:30:42

slot1_5@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 0+06:38:09

slot1_6@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 0+01:13:57

slot1_7@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 0+00:00:03

slot1_8@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 0.000 4096 0+00:00:03

slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 515937117+02:51:10

slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 515953117+02:34:03

slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 515953112+02:43:40

slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 515953117+03:00:16

slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 128617112+02:42:23

slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1031959112+02:47:07

Machines Owner Claimed Unclaimed Matched Preempting Drain

X86_64/LINUX 17 0 8 9 0 0 0

Total 17 0 8 9 0 0 0

[root@srv-sub01 ~]#

Best regards,

Guilherme de Sousa Aranha

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller
Sent: 24 de março de 2021 17:03
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to distribute jobs round robin

When using partitionable slots, The Schedd can start more than one job on a single partitionable slot for each match that it gets from the negotiator. This leads to something that appears to be depth-first matching.

If you configure

CLAIM_PARTITIONABLE_LEFTOVERSE = false

In the Schedd, then it will start only one job for each match it gets from the negotiator, and then your negotiator matching policy will have more traction.

The downside of this is that it will take many more negotiation cycles for a Schedd to fill up a partitionable slot. And if your machines are going to end up completely full anyway, this is wasted effort.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Guilherme De Sousa <garanha@xxxxxxxxxxxx>
Sent: Wednesday, March 24, 2021 8:43 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to distribute jobs round robin

Hi Michael,

Thanks for the quick reply!

Unfortunately it didn't work.. the jobs are still being scheduled to a single machine until full.
I also checked the docs now for NEGOTIATOR_DEPTH_FIRST and the default is false but I set it explicitly anyway.

Best regards,

Guilherme de Sousa Aranha

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Michael Pelletier via HTCondor-users
Sent: 24 de março de 2021 13:26
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Michael Pelletier <michael.v.pelletier@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to distribute jobs round robin

See NEGOTIATOR_DEPTH_FIRST, which was introduced in version 8.8.2.

If you set it to false, you should see the behavior you're looking for.

Michael V Pelletier
Principal Engineer

Raytheon Technologies
Digital Technology
HPC Support Team

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwICAg&c=WdwKmQaPYCZq3ZY-wllUZB0L-BOaCTNMIdrWHq8UZ_4&r=TsFqz1fYO3UwE6LUWx2K2T75_Pte5lPZcaUk-Bn-AoA&m=p-SVkyR9PR-zgjqL9q1NtrMhHT8omfNkHM_CVaJ5P_Y&s=x2OH3o-rUaqn-hfWrOcr8hSFrtXoxtCcz71CUKsfey4&e=

The archives can be found at:
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwICAg&c=WdwKmQaPYCZq3ZY-wllUZB0L-BOaCTNMIdrWHq8UZ_4&r=TsFqz1fYO3UwE6LUWx2K2T75_Pte5lPZcaUk-Bn-AoA&m=p-SVkyR9PR-zgjqL9q1NtrMhHT8omfNkHM_CVaJ5P_Y&s=JSs5VBHIE8v3ZllCXjEy5sydulgpF8rBOkxbczoF3bo&e=

______________________________________________________________________
Este e-mail dirige-se apenas aos destinatários acima indicados, sendo proibida a sua divulgação, total ou parcial, ou o uso ou reenvio não autorizados. Se recebeu este e-mail por engano, por favor notifique o remetente imediatamente via e-mail e exclua-o do seu sistema.
O Banco de Portugal trata os dados pessoais de acordo com os princípios e regras decorrentes da legislação europeia e nacional, em especial do Regulamento (UE) 2016/679, do Parlamento Europeu e do Conselho, de 27 de abril de 2016. Para mais informações consulte a Página do Banco de Portugal sobre proteção de dados. Em caso de dúvidas, pode contactar o Encarregado da Proteção de Dados para o seguinte e-mail: (encarregado.protecao.dados@xxxxxxxxxxxx). Pode também consultar a Autoridade Nacional da Proteção de Dados.

This e-mail is intended only for the use of the recipient(s) named above. Any unauthorised disclosure use or dissemination, either in whole or in part, is prohibited. If you have received this e-mail in error, please notify the sender immediately via e-mail and delete this e-mail from your system.
Banco de Portugal processes personal data in line with the principles and rules in European and national legislation, in particular Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016. For further information, see Banco de Portugal’s webpage on data protection. In case of queries, please contact Banco de Portugal’s Data Protection Officer (encarregado.protecao.dados@xxxxxxxxxxxx). You may also contact the Portuguese Data Protection Authority (Comissão Nacional de Proteção de Dados).

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] How to distribute jobs round robin