|
Hi Dr. Carsten,
As far as I can tell, there is no simple configuration option to assist here. Looking into the code, it appears the constraint is not applied in either the auto or percentage case for slot type configuration. I am not sure if this is technically the correct
behavior or not due because what is the correct behavior for when the constraints don't utilize discovered resources? I am inclined to think that this should be allowed and the EP advertises unused resources (GPUs) due to constraints but that also means that
there is a potential for unused resources during misconfiguration, and in today's climate I assume people don't want GPUs sitting around unused. I will have to discuss with the rest of the team to determine if this is the correct behavior or not (auto/percentage
not applying constraints).
As for potential solutions, I think the only way is script that is executed via configuration to set referable configuration macros with the exact GPU counts. Something like:
#### Config sample (not fully copy and pastable)
# Run condor_gpu_discovery at configuration time
if $(IsStartd)
include command : numa_gpus.sh
endif
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 @=slot ... GPUS = $(NUMA_GPU_REGEX_1) : Regexp("0000:[04][89CD]:00.0", DevicePciBusId) @slot ##### SLOT 2 NUM_SLOTS_TYPE_2 = 1 SLOT_TYPE_2 @=slot ... GPUS = $(NUMA_GPU_REGEX_2) : Regexp("0000:[8C][89CD]:00.0", DevicePciBusId) @slot
#### End sample
Where numa_gpus.sh runs condor_gpu_discover -extra and process the same regex against the DevicePciBusId sub-attribute to get the counts for NUMA_GPU_REGEX_[1/2] to return and inject into the configuration.
Hopefully this makes sense,
Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Carsten Aulbert via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, May 5, 2026 3:52 AM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Cc: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx> Subject: [HTCondor-users] automatically distributing GPUs into NUMA slots Hi,
we just encountered a problem on our GPU execute points, where we create one partitionable slot per NUMA node and want to put all GPUs from this node into the respective slots. Easy to do, as long as all machines are the same: NUM_SLOTS_TYPE_1 = 1 SLOT_TYPE_1 @=slot cpus = 16 ram = 45% swap = 0% GPUS = 8 : Regexp("0000:[04][89CD]:00.0", DevicePciBusId) @slot SLOT_TYPE_1_PARTITIONABLE = True SLOT1_CPU_AFFINITY = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ##### SLOT 2 NUM_SLOTS_TYPE_2 = 1 SLOT_TYPE_2 @=slot cpus = 16 ram = 45% swap = 0% GPUS = 8 : Regexp("0000:[8C][89CD]:00.0", DevicePciBusId) @slot SLOT_TYPE_2_PARTITIONABLE = True SLOT2_CPU_AFFINITY = 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 However, when (temporarily) running out of GPU spares, this leads to condor not starting as the number of GPUs is wrong. I thought of fixing this by using 'auto', but unfortunately, this seems to first divide the available GPUs into both slots, before taking the Regexp into account, e.g. 6 each instead of 4 and 8, which then leads to an error with backtrace: 05/05/26 08:23:37 Local machine resource GPUs = 12 05/05/26 08:23:37 Allocating auto shares for slot type 1: Cpus: 16.000000, Memory: 232096, Swap: 0.00%, Disk: auto, GPUs: auto 05/05/26 08:23:37 slot type 1: Cpus: 16.000000, Memory: 232096, Swap: 0.00%, Disk: 50.00%, GPUs: 6 05/05/26 08:23:37 Allocating auto shares for slot type 2: Cpus: 16.000000, Memory: 232096, Swap: 0.00%, Disk: auto, GPUs: auto 05/05/26 08:23:37 slot type 2: Cpus: 16.000000, Memory: 232096, Swap: 0.00%, Disk: 50.00%, GPUs: 6 05/05/26 08:23:37 bind slot DevIds tag=GPUs contraint=Regexp("0000:[04][89CD]:00.0", DevicePciBusId) 05/05/26 08:23:37 (bt:7c75:8) slot Failed to bind local resource 'GPUs' Backtrace bt:7c75:8 is condor_startd(_ZN13CpuAttributes11bind_DevIdsEP14MachAttributesiibb+0xa3e) [0x561e1689eabe] condor_startd(_Z13buildCpuAttrsP14MachAttributesiRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EEiPiPbb+0xa80) [0x561e168d1fa0] condor_startd(_ZN6ResMgr14init_resourcesEv+0x27e) [0x561e168ae67e] condor_startd(_Z9main_initiPPc+0x3a6) [0x561e168dd006] /lib/libcondor_utils_24_0_3.so(_Z7dc_mainiPPc+0x17ee) [0x7ff7c4e8e60e] /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7ff7c444524a] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7ff7c4445305] condor_startd(_start+0x21) [0x561e168743a1] 05/05/26 08:23:37 ERROR "Failed to bind local resource 'GPUs'" at line 1913 in file ./src/condor_startd.V6/ResAttributes.cpp (HTCondor 24.0.3-1+deb12 if relevant) I can certainly fix this with some configuration management scripting around it, but I wondered whether there is a knob for that I am overlooking. Is there? Cheers and thanks a lot in advance! Carsten -- Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics, Callinstraße 38, 30167 Hannover, Germany, Phone +49 511 762 17185 |