Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] LoadAvg values in PartitionableSlots expected?
- Date: Thu, 14 Mar 2024 08:56:49 +0000
- From: Angel de Vicente <angel.vicente.garrido@xxxxxxxxx>
- Subject: Re: [HTCondor-users] LoadAvg values in PartitionableSlots expected?
Hello,
Angel de Vicente
<angel.vicente.garrido@xxxxxxxxx> writes:
> # condor_status xxxx.xx.xxx.xx -af:h Name Totalcpus Cpus LoadAvg condorloadavg totalloadavg totalcondorloadavg
> Name Totalcpus Cpus LoadAvg condorloadavg totalloadavg totalcondorloadavg
> slot1@xxxxxxxxxxxxxx 32.0 16 1.0 0.0 32.03 16.01
> slot1_1@xxxxxxxxxxxxxx 32.0 16 17.01 16.01 32.03 16.01
>
> It looks as if the non-condor load in the LoadAvg variable is always
> capped at 1.0. Not sure if this is a bug or it is by design. If it is by
> design, what is the reasoning behind it?
looking at the source code I can see where this is coming from, which as
the comment says it doesn't make much sense for multi-core slots, which
is what I'm trying to configure.
file: src/condor_startd.V6/ResMgr.cpp
,----
| // Distribute the owner load over the slots, assign an owner load of 1.0
| // to each slot until the remainer is less than 1.0. then assign the remainder
| // to the next slot, and 0 to all of the remaining slots.
| // Note that before HTCondor 10.x we would assign *all* of the remainder to the last slot
| // even if the value was greater than 1.0, but other than that this algorithm is
| // the same as before. This algorithm doesn't make a lot of sense for multi-core slots
| // but it's the way it has always worked so...
| for (Resource* rip : active) {
| if (total_owner_load < 1.0) {
| rip->set_owner_load(total_owner_load);
| total_owner_load = 0;
| } else {
| rip->set_owner_load(1.0);
| total_owner_load -= 1.0;
| }
| }
`----
The problem I'm facing is that for a given dynamic slot, the default
POLICY:DESKTOP configuration will define CpuBusy as
(LoadAvg - CondorLoadAvg) > 0.5
but since the NonCondorLoadAvg is capped at 1, if my slot is made (as in
the example above) of 16 Cpus, having a LoadAvg of 17.01 is not very
informative, since I will get that LoadAvg value whether I'm running two
extra non-condor processes (or any other number of non-condor
processes).
I thought of redifining CpuBusy, but before I go down that path (which
maybe breaks some other features) I was wondering if there is any advice
regarding this for multi-core slots?
Many thanks,
--
Ãngel de Vicente -- (GPG: 0x64D9FDAE7CD5E939)
Research Software Engineer (Supercomputing and BigData)
Instituto de AstrofÃsica de Canarias (https://www.iac.es/en)