[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] automatically distributing GPUs into NUMA slots



Hi,

we just encountered a problem on our GPU execute points, where we create one partitionable slot per NUMA node and want to put all GPUs from this node into the respective slots.

Easy to do, as long as all machines are the same:

NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 @=slot
 cpus = 16
 ram = 45%
 swap = 0%
 GPUS = 8 : Regexp("0000:[04][89CD]:00.0", DevicePciBusId)
@slot
SLOT_TYPE_1_PARTITIONABLE = True
SLOT1_CPU_AFFINITY = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

##### SLOT 2
NUM_SLOTS_TYPE_2 = 1
SLOT_TYPE_2 @=slot
 cpus = 16
 ram = 45%
 swap = 0%
 GPUS = 8 : Regexp("0000:[8C][89CD]:00.0", DevicePciBusId)
@slot
SLOT_TYPE_2_PARTITIONABLE = True
SLOT2_CPU_AFFINITY = 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31

However, when (temporarily) running out of GPU spares, this leads to condor not starting as the number of GPUs is wrong.

I thought of fixing this by using 'auto', but unfortunately, this seems to first divide the available GPUs into both slots, before taking the Regexp into account, e.g. 6 each instead of 4 and 8, which then leads to an error with backtrace:

05/05/26 08:23:37 Local machine resource GPUs = 12
05/05/26 08:23:37 Allocating auto shares for slot type 1: Cpus: 16.000000, Memory: 232096, Swap: 0.00%, Disk: auto, GPUs: auto 05/05/26 08:23:37 slot type 1: Cpus: 16.000000, Memory: 232096, Swap: 0.00%, Disk: 50.00%, GPUs: 6 05/05/26 08:23:37 Allocating auto shares for slot type 2: Cpus: 16.000000, Memory: 232096, Swap: 0.00%, Disk: auto, GPUs: auto 05/05/26 08:23:37 slot type 2: Cpus: 16.000000, Memory: 232096, Swap: 0.00%, Disk: 50.00%, GPUs: 6 05/05/26 08:23:37 bind slot DevIds tag=GPUs contraint=Regexp("0000:[04][89CD]:00.0", DevicePciBusId)
05/05/26 08:23:37 (bt:7c75:8) slot Failed to bind local resource 'GPUs'
        Backtrace bt:7c75:8 is
condor_startd(_ZN13CpuAttributes11bind_DevIdsEP14MachAttributesiibb+0xa3e) [0x561e1689eabe] condor_startd(_Z13buildCpuAttrsP14MachAttributesiRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EEiPiPbb+0xa80) [0x561e168d1fa0]
        condor_startd(_ZN6ResMgr14init_resourcesEv+0x27e) [0x561e168ae67e]
        condor_startd(_Z9main_initiPPc+0x3a6) [0x561e168dd006]
/lib/libcondor_utils_24_0_3.so(_Z7dc_mainiPPc+0x17ee) [0x7ff7c4e8e60e]
        /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7ff7c444524a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7ff7c4445305]
        condor_startd(_start+0x21) [0x561e168743a1]
05/05/26 08:23:37 ERROR "Failed to bind local resource 'GPUs'" at line 1913 in file ./src/condor_startd.V6/ResAttributes.cpp

(HTCondor 24.0.3-1+deb12 if relevant)

I can certainly fix this with some configuration management scripting around it, but I wondered whether there is a knob for that I am overlooking.

Is there?

Cheers and thanks a lot in advance!

Carsten
--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature