|
Hello,
We've noticed an issue with the environment variable CUDA_VISIBLE_DEVICES (and apparently NVIDIA_VISIBLE_DEVICES) set by the starter on slots that have MIG GPUs.
This seems to come from this commit:
ââââââhttps://github.com/htcondor/htcondor/commit/f2fd7b3fcdfe18beecaa4cc41272bb4fa70e7a72
Which builds upon:
ââââââhttps://github.com/htcondor/htcondor/commit/d7638bca1da3178aaf4408afa5d8324d35399840
With MIG GPUs "condor_gpu_discovery -extra" returns:
DetectedGPUs="MIG-41863e15-8022-51a8-9f75-9490dc788c4d"
Common=[ ComputeUnits=16; DeviceName="NVIDIA H100L-1-12C MIG 1g.12gb"; DeviceUuid="MIG-41863e15-8022-51a8-9f75-9490dc788c4d"; GlobalMemoryMb=10564; ]
MIG_41863e15_8022_51a8_9f75_9490dc788c4d=[ id="MIG-41863e15-8022-51a8-9f75-9490dc788c4d"; ]
Notice how the DeviceUuid has a "MIG-" prefix. (with regular GPUs, the DeviceUuid is a plain uuid; no prefix)
This leads to CUDA_VISIBLE_DEVICES being set as: "CUDA_VISIBLE_DEVICES=GPU-MIG-41863e15-8022-51a8-9f75-9490dc788c4d"
which results in the job's executable not seeing any GPUs.
A valid value would have been: "CUDA_VISIBLE_DEVICES=MIG-41863e15-8022-51a8-9f75-9490dc788c4d"
although, based on local testing, "CUDA_VISIBLE_DEVICES=GPU-41863e15-8022-51a8-9f75-9490dc788c4d" also works.
We've temporarily mitigated this by setting:
ââââââAUTO_SET_NVIDIA_VISIBLE_DEVICES = False
and
ââââââENVIRONMENT_FOR_AssignedGPUs = GPU_DEVICE_ORDINAL=/(CUDA|OCL)// CUDA_VISIBLE_DEVICES=/CUDA// NVIDIA_VISIBLE_DEVICES=/CUDA//
in the starter's config.
This seems like a bug either with the starter or the gpu discovery; or perhaps we're missing something?
Best regards,
Panos
|