[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPU sort order, Re: Make only one GPU available to HTCondor?



If the GPU is not working or for some reason you want it to not be used.  then add its id to the OFFLINE_GPUS configuration
knob.   

If you want to control which GPUs are bound to which slots at config time, the manual describes how in the section on 
configuring gpus. 


NUM_SLOTS_TYPE_2 = 1
SLOT_TYPE_2 @=slot
   GPUs = 1 : "GPU-6a96bd13"
   CPUs = 1
   Memory = auto
@slot




From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
Sent: Thursday, February 6, 2025 6:13 AM
To: HTCondor Users Mailinglist <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] GPU sort order, Re: Make only one GPU available to HTCondor?
 
Good morning/afternoon/...,

for actual reasons, I've got to dig this out - there had been no responses
last year:

On Mon, 2024-07-29 at 16:22:42 +0200, Steffen Grunewald wrote:
> Hi all,
>
> the subject says it: We want to make a single GPU of a particular machine
> available to HTCondor. How do I select a specific ID, or just the last in
> the "DetectedGPUs" list created by condor_gpu_discovery?

It turned out that HTCondor doesn't seem to obey the order in DetectedGPUs
(which is in sync with the one returned by `nvidia-smi` which in turn seems
to be the same as `gpustat` output), it will instead order the GPUs by
their UUIDs (at least if the model is the same?).

This makes a huge difference when assigning e.g. 7 GPUs to a disabled slot
and the remaining one to an active one: in our case, it was GPU #4 (out of
#0..#7, not the one users would see as #7) that was used, much to ther
surprise of both the non-HTCondor and the HTCondor user when the clash
occurred.

Can this be avoided, i.e., can I select a GPU (by its UUID, or bus ID, or
anything else) to "put into a slot"?
Is there a means to modify HTCondor's indexing of GPUs, e.g. to just follow
the order provided in DetectedGPUs?

Thanks,
 Steffen

--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/