Hi,we just encountered a problem on our GPU execute points, where we create one partitionable slot per NUMA node and want to put all GPUs from this node into the respective slots.
Easy to do, as long as all machines are the same:
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 @=slot
cpus = 16
ram = 45%
swap = 0%
GPUS = 8 : Regexp("0000:[04][89CD]:00.0", DevicePciBusId)
@slot
SLOT_TYPE_1_PARTITIONABLE = True
SLOT1_CPU_AFFINITY = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
##### SLOT 2
NUM_SLOTS_TYPE_2 = 1
SLOT_TYPE_2 @=slot
cpus = 16
ram = 45%
swap = 0%
GPUS = 8 : Regexp("0000:[8C][89CD]:00.0", DevicePciBusId)
@slot
SLOT_TYPE_2_PARTITIONABLE = True
SLOT2_CPU_AFFINITY = 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
However, when (temporarily) running out of GPU spares, this leads to
condor not starting as the number of GPUs is wrong.
I thought of fixing this by using 'auto', but unfortunately, this seems to first divide the available GPUs into both slots, before taking the Regexp into account, e.g. 6 each instead of 4 and 8, which then leads to an error with backtrace:
05/05/26 08:23:37 Local machine resource GPUs = 1205/05/26 08:23:37 Allocating auto shares for slot type 1: Cpus: 16.000000, Memory: 232096, Swap: 0.00%, Disk: auto, GPUs: auto 05/05/26 08:23:37 slot type 1: Cpus: 16.000000, Memory: 232096, Swap: 0.00%, Disk: 50.00%, GPUs: 6 05/05/26 08:23:37 Allocating auto shares for slot type 2: Cpus: 16.000000, Memory: 232096, Swap: 0.00%, Disk: auto, GPUs: auto 05/05/26 08:23:37 slot type 2: Cpus: 16.000000, Memory: 232096, Swap: 0.00%, Disk: 50.00%, GPUs: 6 05/05/26 08:23:37 bind slot DevIds tag=GPUs contraint=Regexp("0000:[04][89CD]:00.0", DevicePciBusId)
05/05/26 08:23:37 (bt:7c75:8) slot Failed to bind local resource 'GPUs'
Backtrace bt:7c75:8 is
condor_startd(_ZN13CpuAttributes11bind_DevIdsEP14MachAttributesiibb+0xa3e)
[0x561e1689eabe]
condor_startd(_Z13buildCpuAttrsP14MachAttributesiRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EEiPiPbb+0xa80)
[0x561e168d1fa0]
condor_startd(_ZN6ResMgr14init_resourcesEv+0x27e) [0x561e168ae67e]
condor_startd(_Z9main_initiPPc+0x3a6) [0x561e168dd006]
/lib/libcondor_utils_24_0_3.so(_Z7dc_mainiPPc+0x17ee)
[0x7ff7c4e8e60e]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7ff7c444524a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)
[0x7ff7c4445305]
condor_startd(_start+0x21) [0x561e168743a1]
05/05/26 08:23:37 ERROR "Failed to bind local resource 'GPUs'" at line
1913 in file ./src/condor_startd.V6/ResAttributes.cpp
(HTCondor 24.0.3-1+deb12 if relevant)I can certainly fix this with some configuration management scripting around it, but I wondered whether there is a knob for that I am overlooking.
Is there? Cheers and thanks a lot in advance! Carsten -- Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics, CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature