[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] LoadAvg values in PartitionableSlots expected?



Hello,

Angel de Vicente
<angel.vicente.garrido@xxxxxxxxx> writes:

> # condor_status  xxxx.xx.xxx.xx -af:h Name Totalcpus Cpus LoadAvg condorloadavg totalloadavg totalcondorloadavg
> Name                   Totalcpus             Cpus LoadAvg               condorloadavg         totalloadavg          totalcondorloadavg
> slot1@xxxxxxxxxxxxxx   32.0                  16   1.0                   0.0                   32.03                 16.01
> slot1_1@xxxxxxxxxxxxxx 32.0                  16   17.01                 16.01                 32.03                 16.01
>
> It looks as if the non-condor load in the LoadAvg variable is always
> capped at 1.0. Not sure if this is a bug or it is by design. If it is by
> design, what is the reasoning behind it?

looking at the source code I can see where this is coming from, which as
the comment says it doesn't make much sense for multi-core slots, which
is what I'm trying to configure.

 file: src/condor_startd.V6/ResMgr.cpp
,----
| // Distribute the owner load over the slots, assign an owner load of 1.0
| // to each slot until the remainer is less than 1.0.  then assign the remainder
| // to the next slot, and 0 to all of the remaining slots.
| // Note that before HTCondor 10.x we would assign *all* of the remainder to the last slot
| // even if the value was greater than 1.0, but other than that this algorithm is
| // the same as before.  This algorithm doesn't make a lot of sense for multi-core slots
| // but it's the way it has always worked so...
| for (Resource* rip : active) {
|         if (total_owner_load < 1.0) {
|                 rip->set_owner_load(total_owner_load);
|                 total_owner_load = 0;
|         } else {
|                 rip->set_owner_load(1.0);
|                 total_owner_load -= 1.0;
|         }
| }
`----


The problem I'm facing is that for a given dynamic slot, the default
POLICY:DESKTOP configuration will define CpuBusy as

   (LoadAvg - CondorLoadAvg) > 0.5

but since the NonCondorLoadAvg is capped at 1, if my slot is made (as in
the example above) of 16 Cpus, having a LoadAvg of 17.01 is not very
informative, since I will get that LoadAvg value whether I'm running two
extra non-condor processes (or any other number of non-condor
processes).

I thought of redifining CpuBusy, but before I go down that path (which
maybe breaks some other features) I was wondering if there is any advice
regarding this for multi-core slots?

Many thanks,
-- 
Ãngel de Vicente                 -- (GPG: 0x64D9FDAE7CD5E939)
 Research Software Engineer (Supercomputing and BigData)
 Instituto de AstrofÃsica de Canarias (https://www.iac.es/en)