[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] automatically distributing GPUs into NUMA slots



Hi Cole,

sorry for the late reply.

On 5/7/26 18:20, Cole Bollig wrote:
As for potential solutions, I think the only way is script that is executed via configuration to set referable configuration macros with the exact GPU counts. Something like:

Yeah, I figured as much :).

As a GPU may break after many days of operation, I think I'll query the state via a timer and update a config snippet via an external script and run condor_reconfig. That way, I should be able to ensure that new jobs will only arrive on still working GPUs.

Another quick kind of related question:

So far, we used SLOT<N>_CPU_AFFINITY to enforce NUMA boundaries between partitionable slots (at least that's what I think this is doing). We do that as all our processing is pretty sensitive to bandwidth and we do not want to move vast amounts of data between CPUs.

However, I just saw
"
This configuration variable is replaced by ASSIGN_CPU_AFFINITY. Do not enable this configuration variable unless using glidein or another unusual setup.
"

but as ASSIGN_CPU_AFFINITY is a simple boolean, how can I ensure that certain CPU cores are only used by a specific slot?

Cheers and thanks a lot in advance.

Carsten


--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature