[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] automatically distributing GPUs into NUMA slots



Hi Dr. Carsten,

As far as I can tell, there is no simple configuration option to assist here. Looking into the code, it appears the constraint is not applied in either the auto or percentage case for slot type configuration. I am not sure if this is technically the correct behavior or not due because what is the correct behavior for when the constraints don't utilize discovered resources? I am inclined to think that this should be allowed and the EP advertises unused resources (GPUs) due to constraints but that also means that there is a potential for unused resources during misconfiguration, and in today's climate I assume people don't want GPUs sitting around unused. I will have to discuss with the rest of the team to determine if this is the correct behavior or not (auto/percentage not applying constraints).

As for potential solutions, I think the only way is script that is executed via configuration to set referable configuration macros with the exact GPU counts. Something like:

#### Config sample (not fully copy and pastable)
# Run condor_gpu_discovery at configuration time
if $(IsStartd)
   include command : numa_gpus.sh
endif

NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 @=slot
  ...
  GPUS = $(NUMA_GPU_REGEX_1) : Regexp("0000:[04][89CD]:00.0", DevicePciBusId)
@slot

##### SLOT 2
NUM_SLOTS_TYPE_2 = 1
SLOT_TYPE_2 @=slot
  ...
  GPUS = $(NUMA_GPU_REGEX_2) : Regexp("0000:[8C][89CD]:00.0", DevicePciBusId)
@slot
#### End sample

Where numa_gpus.sh runs condor_gpu_discover -extra and process the same regex against the DevicePciBusId sub-attribute to get the counts for NUMA_GPU_REGEX_[1/2] to return and inject into the configuration.

Hopefully this makes sense,
Cole Bollig


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Carsten Aulbert via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, May 5, 2026 3:52 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
Subject: [HTCondor-users] automatically distributing GPUs into NUMA slots
 
Hi,

we just encountered a problem on our GPU execute points, where we create
one partitionable slot per NUMA node and want to put all GPUs from this
node into the respective slots.

Easy to do, as long as all machines are the same:

NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 @=slot
  cpus = 16
  ram = 45%
  swap = 0%
  GPUS = 8 : Regexp("0000:[04][89CD]:00.0", DevicePciBusId)
@slot
SLOT_TYPE_1_PARTITIONABLE = True
SLOT1_CPU_AFFINITY = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

##### SLOT 2
NUM_SLOTS_TYPE_2 = 1
SLOT_TYPE_2 @=slot
  cpus = 16
  ram = 45%
  swap = 0%
  GPUS = 8 : Regexp("0000:[8C][89CD]:00.0", DevicePciBusId)
@slot
SLOT_TYPE_2_PARTITIONABLE = True
SLOT2_CPU_AFFINITY = 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31

However, when (temporarily) running out of GPU spares, this leads to
condor not starting as the number of GPUs is wrong.

I thought of fixing this by using 'auto', but unfortunately, this seems
to first divide the available GPUs into both slots, before taking the
Regexp into account, e.g. 6 each instead of 4 and 8, which then leads to
an error with backtrace:

05/05/26 08:23:37 Local machine resource GPUs = 12
05/05/26 08:23:37 Allocating auto shares for slot type 1: Cpus:
16.000000, Memory: 232096, Swap: 0.00%, Disk: auto, GPUs: auto
05/05/26 08:23:37   slot type 1: Cpus: 16.000000, Memory: 232096, Swap:
0.00%, Disk: 50.00%, GPUs: 6
05/05/26 08:23:37 Allocating auto shares for slot type 2: Cpus:
16.000000, Memory: 232096, Swap: 0.00%, Disk: auto, GPUs: auto
05/05/26 08:23:37   slot type 2: Cpus: 16.000000, Memory: 232096, Swap:
0.00%, Disk: 50.00%, GPUs: 6
05/05/26 08:23:37 bind slot DevIds tag=GPUs
contraint=Regexp("0000:[04][89CD]:00.0", DevicePciBusId)
05/05/26 08:23:37 (bt:7c75:8) slot Failed to bind local resource 'GPUs'
         Backtrace bt:7c75:8 is
 
condor_startd(_ZN13CpuAttributes11bind_DevIdsEP14MachAttributesiibb+0xa3e)
[0x561e1689eabe]
 
condor_startd(_Z13buildCpuAttrsP14MachAttributesiRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EEiPiPbb+0xa80)
[0x561e168d1fa0]
         condor_startd(_ZN6ResMgr14init_resourcesEv+0x27e) [0x561e168ae67e]
         condor_startd(_Z9main_initiPPc+0x3a6) [0x561e168dd006]
         /lib/libcondor_utils_24_0_3.so(_Z7dc_mainiPPc+0x17ee)
[0x7ff7c4e8e60e]
         /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7ff7c444524a]
         /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)
[0x7ff7c4445305]
         condor_startd(_start+0x21) [0x561e168743a1]
05/05/26 08:23:37 ERROR "Failed to bind local resource 'GPUs'" at line
1913 in file ./src/condor_startd.V6/ResAttributes.cpp

(HTCondor 24.0.3-1+deb12 if relevant)

I can certainly fix this with some configuration management scripting
around it, but I wondered whether there is a knob for that I am overlooking.

Is there?

Cheers and thanks a lot in advance!

Carsten
--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
Callinstraße 38, 30167 Hannover, Germany, Phone +49 511 762 17185