Re: [HTCondor-users] automatically distributing GPUs into NUMA slots

As far as I can tell, there is no simple configuration option to assist here. Looking into the code, it appears the constraint is not applied in either the auto or percentage case for slot type configuration. I am not sure if this is technically the correct behavior or not due because what is the correct behavior for when the constraints don't utilize discovered resources? I am inclined to think that this should be allowed and the EP advertises unused resources (GPUs) due to constraints but that also means that there is a potential for unused resources during misconfiguration, and in today's climate I assume people don't want GPUs sitting around unused. I will have to discuss with the rest of the team to determine if this is the correct behavior or not (auto/percentage not applying constraints).

Hi,

we just encountered a problem on our GPU execute points, where we create
one partitionable slot per NUMA node and want to put all GPUs from this
node into the respective slots.

Easy to do, as long as all machines are the same:

NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 @=slot
cpus = 16
ram = 45%
swap = 0%
GPUS = 8 : Regexp("0000:[04][89CD]:00.0", DevicePciBusId)
@slot
SLOT_TYPE_1_PARTITIONABLE = True
SLOT1_CPU_AFFINITY = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

##### SLOT 2
NUM_SLOTS_TYPE_2 = 1
SLOT_TYPE_2 @=slot
cpus = 16
ram = 45%
swap = 0%
GPUS = 8 : Regexp("0000:[8C][89CD]:00.0", DevicePciBusId)
@slot
SLOT_TYPE_2_PARTITIONABLE = True
SLOT2_CPU_AFFINITY = 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31

However, when (temporarily) running out of GPU spares, this leads to
condor not starting as the number of GPUs is wrong.

I thought of fixing this by using 'auto', but unfortunately, this seems
to first divide the available GPUs into both slots, before taking the
Regexp into account, e.g. 6 each instead of 4 and 8, which then leads to
an error with backtrace:

05/05/26 08:23:37 Local machine resource GPUs = 12
05/05/26 08:23:37 Allocating auto shares for slot type 1: Cpus:
16.000000, Memory: 232096, Swap: 0.00%, Disk: auto, GPUs: auto
05/05/26 08:23:37   slot type 1: Cpus: 16.000000, Memory: 232096, Swap:
0.00%, Disk: 50.00%, GPUs: 6
05/05/26 08:23:37 Allocating auto shares for slot type 2: Cpus:
16.000000, Memory: 232096, Swap: 0.00%, Disk: auto, GPUs: auto
05/05/26 08:23:37   slot type 2: Cpus: 16.000000, Memory: 232096, Swap:
0.00%, Disk: 50.00%, GPUs: 6
05/05/26 08:23:37 bind slot DevIds tag=GPUs
contraint=Regexp("0000:[04][89CD]:00.0", DevicePciBusId)
05/05/26 08:23:37 (bt:7c75:8) slot Failed to bind local resource 'GPUs'
         Backtrace bt:7c75:8 is

condor_startd(_ZN13CpuAttributes11bind_DevIdsEP14MachAttributesiibb+0xa3e)
[0x561e1689eabe]

condor_startd(_Z13buildCpuAttrsP14MachAttributesiRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EEiPiPbb+0xa80)
[0x561e168d1fa0]
         condor_startd(_ZN6ResMgr14init_resourcesEv+0x27e) [0x561e168ae67e]
         condor_startd(_Z9main_initiPPc+0x3a6) [0x561e168dd006]
         /lib/libcondor_utils_24_0_3.so(_Z7dc_mainiPPc+0x17ee)
[0x7ff7c4e8e60e]
         /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7ff7c444524a]
         /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)
[0x7ff7c4445305]
         condor_startd(_start+0x21) [0x561e168743a1]
05/05/26 08:23:37 ERROR "Failed to bind local resource 'GPUs'" at line
1913 in file ./src/condor_startd.V6/ResAttributes.cpp

(HTCondor 24.0.3-1+deb12 if relevant)

I can certainly fix this with some configuration management scripting
around it, but I wondered whether there is a knob for that I am overlooking.

Is there?

Cheers and thanks a lot in advance!

Carsten
--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
Callinstraße 38, 30167 Hannover, Germany, Phone +49 511 762 17185