[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Bad CUDA_VISIBLE_DEVICES with MIG since v24.0.14



Hello,

We've noticed an issue with the environment variable CUDA_VISIBLE_DEVICES (and apparently NVIDIA_VISIBLE_DEVICES) set by the starter on slots that have MIG GPUs.
This seems to come from this commit:
ââââââhttps://github.com/htcondor/htcondor/commit/f2fd7b3fcdfe18beecaa4cc41272bb4fa70e7a72
Which builds upon:
ââââââhttps://github.com/htcondor/htcondor/commit/d7638bca1da3178aaf4408afa5d8324d35399840

With MIG GPUs "condor_gpu_discovery -extra" returns:

DetectedGPUs="MIG-41863e15-8022-51a8-9f75-9490dc788c4d"
Common=[ ComputeUnits=16; DeviceName="NVIDIA H100L-1-12C MIG 1g.12gb"; DeviceUuid="MIG-41863e15-8022-51a8-9f75-9490dc788c4d"; GlobalMemoryMb=10564; ]
MIG_41863e15_8022_51a8_9f75_9490dc788c4d=[ id="MIG-41863e15-8022-51a8-9f75-9490dc788c4d"; ]

Notice how the DeviceUuid has a "MIG-" prefix. (with regular GPUs, the DeviceUuid is a plain uuid; no prefix)

This leads to CUDA_VISIBLE_DEVICES being set as: "CUDA_VISIBLE_DEVICES=GPU-MIG-41863e15-8022-51a8-9f75-9490dc788c4d"
which results in the job's executable not seeing any GPUs.

A valid value would have been: "CUDA_VISIBLE_DEVICES=MIG-41863e15-8022-51a8-9f75-9490dc788c4d"
although, based on local testing, "CUDA_VISIBLE_DEVICES=GPU-41863e15-8022-51a8-9f75-9490dc788c4d" also works.

We've temporarily mitigated this by setting:
ââââââAUTO_SET_NVIDIA_VISIBLE_DEVICES = False
and
ââââââENVIRONMENT_FOR_AssignedGPUs = GPU_DEVICE_ORDINAL=/(CUDA|OCL)//  CUDA_VISIBLE_DEVICES=/CUDA//  NVIDIA_VISIBLE_DEVICES=/CUDA//
in the starter's config.


This seems like a bug either with the starter or the gpu discovery; or perhaps we're missing something?

Best regards,
Panos