[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] GPU sort order, Re: Make only one GPU available to HTCondor?



Good morning/afternoon/...,

for actual reasons, I've got to dig this out - there had been no responses
last year:

On Mon, 2024-07-29 at 16:22:42 +0200, Steffen Grunewald wrote:
> Hi all,
> 
> the subject says it: We want to make a single GPU of a particular machine
> available to HTCondor. How do I select a specific ID, or just the last in
> the "DetectedGPUs" list created by condor_gpu_discovery?

It turned out that HTCondor doesn't seem to obey the order in DetectedGPUs
(which is in sync with the one returned by `nvidia-smi` which in turn seems
to be the same as `gpustat` output), it will instead order the GPUs by
their UUIDs (at least if the model is the same?).

This makes a huge difference when assigning e.g. 7 GPUs to a disabled slot
and the remaining one to an active one: in our case, it was GPU #4 (out of
#0..#7, not the one users would see as #7) that was used, much to ther
surprise of both the non-HTCondor and the HTCondor user when the clash
occurred.

Can this be avoided, i.e., can I select a GPU (by its UUID, or bus ID, or
anything else) to "put into a slot"?
Is there a means to modify HTCondor's indexing of GPUs, e.g. to just follow
the order provided in DetectedGPUs? 

Thanks,
 Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~