[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Change to long GPU UUIDs



You will need to restart, because the argument that enables long uuids is an argument to condor_gpu_discovery, and that is only run by the STARTD on startup, not on reconfig.

The STARTD also does not track GPUs by the long ids internally, but even if it did, a restart would be needed because of the above reason.

Our emperical testing shows that the short uuids are sufficiently unique to prevent any confusion on a single machine.  Have you found a machine where that is not true?

-tj



From: HTCondor-users on behalf of Steffen Grunewald
Sent: Thursday, April 17, 2025 7:30 AM
To: HTCondor Users Mailinglist
Subject: [HTCondor-users] Change to long GPU UUIDs

Good afternoon,

for some very specific reason, we need to change our GPU machine setup
to report long, not shortened, GPU UUIDs (condor_gpu_discovery -uuid).
Currently the STARTD only knows about the short UUIDs, and several
dynamic slots have been created with corresponding "AssignedGPUs" set.

May I safely assume that GPU resources are indexed internally *not* by
their (short or long) UUID, so I could just do a "condor_reconfigure"
to switch to long UUIDs - or would this (a) request something more
drastic (e.g., condor_restart -startd) or (b) make the partitionable
slot lose track of resources already scheduled?

In short, should I better wait for the machine to become idle?

Thanks,
 Steffen


PS. For the curious: it turns out that "jax" supports long UUIDs in
CUDA_VISIBLE_DEVICES, and a single short UUID, but not multiple
short UUIDs. This isn't documented anywhere, and nobody seems to run
jax code in an HTCOndor context.

--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

Join us in June at Throughput Computing 25: https://urldefense.com/v3/__https://osg-htc.org/htc25__;!!Mak6IKo!OyZ37OWJEuS9UZONIgwMJX5mNgJkJQHT9Wfnk0UaTBG3AtpbIH6dx0ulrhTFUrHUZas9y2d9a69gSwyyM2T3dxrO57C2dLKR$

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/