You can configure OFFLINE_MACHINE_RESOURCE_GPUS = CUDA0 to prevent HTCondor from assigning that GPU to a slot. -tj From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of Michael Pelletier Hi folks, I’m working on getting a new exec node stood up with multiple GPUs, for use by job which need dedicated GPU assignment – a first in our pools. Other jobs I’ve dealt with had an internal lock and queue mechanism to be able to share all the
GPUs on the system, so I didn’t need to worry about HTCondor assignments. I’d like to be able to prevent HTCondor from assigning a GPU that’s already in use by a non-HTCondor process to one of its jobs. I wrote a wrapper for nvidia-smi which pulls in an ad like so: hostname$ /user/condor/libexec/condor_nvidia_probe CUDA0FreeGlobalMemory = 2441 CUDA0UtilizationPct = 100 CUDA1FreeGlobalMemory = 4031 CUDA1UtilizationPct = 0 CUDA2FreeGlobalMemory = 4031 CUDA2UtilizationPct = 0 CUDA3FreeGlobalMemory = 4031 CUDA3UtilizationPct = 0 CUDAFreeGlobalMemory = 14534 CUDAUtilization = 25.0 -- hostname$
So in the above case, I’d like to prevent any HTCondor job from being assigned the CUDA0 device since it’s 100% used, and preferably advertise one fewer GPU available on the system. Is there any means to do this? I’ve been mulling the kinds
of expressions I think I might need and my brain is starting to hurt a bit. Michael V. Pelletier |