[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] automatically distributing GPUs into NUMA slots

Date: Mon, 11 May 2026 13:57:31 +0200
From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
Subject: Re: [HTCondor-users] automatically distributing GPUs into NUMA slots

Hi Cole,

sorry for the late reply.

On 5/7/26 18:20, Cole Bollig wrote:

As for potential solutions, I think the only way is script that isexecuted via configuration to set referable configuration macros withthe exact GPU counts. Something like:


Yeah, I figured as much :).

As a GPU may break after many days of operation, I think I'll query thestate via a timer and update a config snippet via an external script andrun condor_reconfig. That way, I should be able to ensure that new jobswill only arrive on still working GPUs.


Another quick kind of related question:

So far, we used SLOT<N>_CPU_AFFINITY to enforce NUMA boundaries betweenpartitionable slots (at least that's what I think this is doing). We dothat as all our processing is pretty sensitive to bandwidth and we donot want to move vast amounts of data between CPUs.


However, I just saw
"

This configuration variable is replaced by ASSIGN_CPU_AFFINITY. Do notenable this configuration variable unless using glidein or anotherunusual setup.

but as ASSIGN_CPU_AFFINITY is a simple boolean, how can I ensure thatcertain CPU cores are only used by a specific slot?


Cheers and thanks a lot in advance.

Carsten


--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

References:
- [HTCondor-users] automatically distributing GPUs into NUMA slots
  - From: Carsten Aulbert
- Re: [HTCondor-users] automatically distributing GPUs into NUMA slots
  - From: Cole Bollig

Prev by Date: Re: [HTCondor-users] Authentication Issue between HTCondorCE Schedd and Batch Schedd
Next by Date: Re: [HTCondor-users] Authentication Issue between HTCondorCE Schedd and Batch Schedd
Previous by thread: Re: [HTCondor-users] automatically distributing GPUs into NUMA slots
Next by thread: [HTCondor-users] Glidein fail to authenticate to our CE collectors.
Index(es):
- Date
- Thread