[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Bad CUDA_VISIBLE_DEVICES with MIG since v24.0.14




Hi Panos,

Thank you very much for the detailed bug report, detective work, and the configuration work-around.
We will try to address this for the next release, as we really want MIG to work out-of-the-box. The development ticket for this item is :
   https://opensciencegrid.atlassian.net/browse/HTCONDOR-3567

Thanks again and apologies for the regression,
regards,
Todd


On 2/25/2026 8:50 AM, Panagiotis Gkonis via HTCondor-users wrote:
Hello,

We've noticed an issue with the environment variable CUDA_VISIBLE_DEVICES (and apparently NVIDIA_VISIBLE_DEVICES) set by the starter on slots that have MIG GPUs.
This seems to come from this commit:
ââââââhttps://github.com/htcondor/htcondor/commit/f2fd7b3fcdfe18beecaa4cc41272bb4fa70e7a72
Which builds upon:
ââââââhttps://github.com/htcondor/htcondor/commit/d7638bca1da3178aaf4408afa5d8324d35399840

With MIG GPUs "condor_gpu_discovery -extra" returns:

DetectedGPUs="MIG-41863e15-8022-51a8-9f75-9490dc788c4d"
Common=[ ComputeUnits=16; DeviceName="NVIDIA H100L-1-12C MIG 1g.12gb"; DeviceUuid="MIG-41863e15-8022-51a8-9f75-9490dc788c4d"; GlobalMemoryMb=10564; ]
MIG_41863e15_8022_51a8_9f75_9490dc788c4d=[ id="MIG-41863e15-8022-51a8-9f75-9490dc788c4d"; ]

Notice how the DeviceUuid has a "MIG-" prefix. (with regular GPUs, the DeviceUuid is a plain uuid; no prefix)

This leads to CUDA_VISIBLE_DEVICES being set as: "CUDA_VISIBLE_DEVICES=GPU-MIG-41863e15-8022-51a8-9f75-9490dc788c4d"
which results in the job's executable not seeing any GPUs.

A valid value would have been: "CUDA_VISIBLE_DEVICES=MIG-41863e15-8022-51a8-9f75-9490dc788c4d"
although, based on local testing, "CUDA_VISIBLE_DEVICES=GPU-41863e15-8022-51a8-9f75-9490dc788c4d" also works.

We've temporarily mitigated this by setting:
ââââââAUTO_SET_NVIDIA_VISIBLE_DEVICES = False
and
ââââââENVIRONMENT_FOR_AssignedGPUs = GPU_DEVICE_ORDINAL=/(CUDA|OCL)//  CUDA_VISIBLE_DEVICES=/CUDA//  NVIDIA_VISIBLE_DEVICES=/CUDA//
in the starter's config.


This seems like a bug either with the starter or the gpu discovery; or perhaps we're missing something?

Best regards,
Panos

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/ 


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1205 University Ave.
Phone: (608) 263-7132                   Madison, WI 53706
Personal Zoom Room: https://uwmadison.zoom.us/my/tannenba