Hi Panos,
Thank you very much for the detailed bug report, detective work,
and the configuration work-around.
We will try to address this for the next release, as we really
want MIG to work out-of-the-box. The development ticket for this
item is :
https://opensciencegrid.atlassian.net/browse/HTCONDOR-3567
Thanks again and apologies for the regression,
regards,
Todd
On 2/25/2026 8:50 AM, Panagiotis Gkonis via HTCondor-users wrote:
Hello,
We've noticed an issue with the environment variable
CUDA_VISIBLE_DEVICES (and apparently NVIDIA_VISIBLE_DEVICES) set
by the starter on slots that have MIG GPUs.
This seems to come from this commit:
Which builds upon:
With MIG GPUs "condor_gpu_discovery -extra" returns:
DetectedGPUs="MIG-41863e15-8022-51a8-9f75-9490dc788c4d"
Common=[ ComputeUnits=16; DeviceName="NVIDIA H100L-1-12C MIG
1g.12gb"; DeviceUuid="MIG-41863e15-8022-51a8-9f75-9490dc788c4d";
GlobalMemoryMb=10564; ]
MIG_41863e15_8022_51a8_9f75_9490dc788c4d=[
id="MIG-41863e15-8022-51a8-9f75-9490dc788c4d"; ]
Notice how the DeviceUuid has a "MIG-" prefix. (with regular
GPUs, the DeviceUuid is a plain uuid; no prefix)
This leads to CUDA_VISIBLE_DEVICES being set as:
"CUDA_VISIBLE_DEVICES=GPU-MIG-41863e15-8022-51a8-9f75-9490dc788c4d"
which results in the job's executable not seeing any GPUs.
A valid value would have been:
"CUDA_VISIBLE_DEVICES=MIG-41863e15-8022-51a8-9f75-9490dc788c4d"
although, based on local testing,
"CUDA_VISIBLE_DEVICES=GPU-41863e15-8022-51a8-9f75-9490dc788c4d"
also works.
We've temporarily mitigated this by setting:
ââââââAUTO_SET_NVIDIA_VISIBLE_DEVICES = False
and
ââââââENVIRONMENT_FOR_AssignedGPUs =
GPU_DEVICE_ORDINAL=/(CUDA|OCL)// CUDA_VISIBLE_DEVICES=/CUDA//
NVIDIA_VISIBLE_DEVICES=/CUDA//
in the starter's config.
This seems like a bug either with the starter or the gpu
discovery; or perhaps we're missing something?
Best regards,
Panos
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/