[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Fix LoadAvg values, so that they better reflect the number of CPUs of a slot



I dug into your PR, and to the changes from October, and I think I see a way to square them. 

The October PR changes the overall loop where load is assigned, because the older code was sorting the slots
and would end up moving the load assignment from slot to slot as the slots changed state, which is not ideal. 
The older code also didn't work correctly for machines with both p-slots and static slots, or machines with multiple p-slots.

I think the issue you are having is because of this bit of code that propagates p-slot load into all of the d-slots

ââââââââââââif (parent) {
ââââââââââââââââââ// d-slots inherit owner load from the parent clamped to the cpu count of the d-slot
ââââââââââââââââââdouble parent_load = parent->owner_load();
ââââââââââââââââââdouble dslot_load = MIN(parent_load, rip->r_attr->total_cpus());
âââââââââââââââââârip->set_owner_load(dslot_load);

With this code all of the d-slots will all be evicted at once if the load hits the given threshold.   

It would be more like the behavior of your PR to assign some of the load to the idle cores on the p-slot, 
and the remaining load to the d-slots underneath the p-slot until there is no load remaining.  This would
give the same effect as your PR for machines that have a single p-slot, but also give stable behavior
for machines with more than one p-slot or for those with both static slots and p-slots. 

-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Angel de Vicente <angel.vicente.garrido@xxxxxxxxx>
Sent: Monday, February 3, 2025 5:56 AM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Fix LoadAvg values, so that they better reflect the number of CPUs of a slot
 
Hello,

some months ago (March'24) I submitted a PR to modify the way LoadAvg
was calculated for dynamic slots. This was accepted, but later (Oct'24)
the relevant code was changed. When doing today an upgrade of HTCondor
to version 24.0.3 I realized that the new LoadAvg calculation doesn't
work "properly" (at least not four our use case).

I just wrote a comment about it in the original PR
(https://urldefense.com/v3/__https://github.com/htcondor/htcondor/pull/2317__;!!Mak6IKo!IlhxMH5KcI4cVs77QZG_vearn-LqBT8mOoWZnF6pfXNiMYvkqteyM-qMsEMyKQR8znADiXUc1KWLDybFnYpzVmLDJt1X$ ), but I'm not sure
comments in an already merged pull request will get any attention, so I
thought of sending it here as well. The comment is
https://urldefense.com/v3/__https://github.com/htcondor/htcondor/pull/2317*issuecomment-2630656894__;Iw!!Mak6IKo!IlhxMH5KcI4cVs77QZG_vearn-LqBT8mOoWZnF6pfXNiMYvkqteyM-qMsEMyKQR8znADiXUc1KWLDybFnYpzVnmXLgTX$

Hopefully John Knoeller reads this and he can explain the rationale
behind his Oct'24 commit?

Cheers,
--
Ãngel de Vicente 
 Research Software Engineer (Supercomputing and BigData)
 Instituto de AstrofÃsica de Canarias (https://urldefense.com/v3/__https://www.iac.es/en__;!!Mak6IKo!IlhxMH5KcI4cVs77QZG_vearn-LqBT8mOoWZnF6pfXNiMYvkqteyM-qMsEMyKQR8znADiXUc1KWLDybFnYpzVt87cNx4$ )


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/