You can force condor_gpu_discovery to do OpenCL detection by adding the -opencl argument. condor_gpu_discovery -opencl -extra Otherwise it will prefer cuda detection over opencl detection, and will never do both so that it doesn’t end up overcounting GPUs that show up both ways. -tj From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Chris Brew - STFC UKRI via HTCondor-users Hi, This is all with Condor 10.0.7 on Rocky Linux 8. I’ve got a test node with a couple of AMD Instinct MI GPGPU cards (i.e. not CUDA) in but I’m having no luck getting them to show up in the machine ClassAds. Condor_gpu_discovery sees them fine: # /usr/libexec/condor/condor_gpu_discovery -extra -properties DetectedGPUs="OCL0, OCL1" Common=[ ClockMhz=1700; ComputeUnits=104; DeviceName="gfx90a:sramecc+:xnack-"; ECCEnabled=false; GlobalMemoryMb=65520; OpenCLVersion=2.0; ] OCL0=[ id="OCL0"; ] OCL1=[ id="OCL1"; ] But the StartD doesn’t: # grep -I gpu /var/log/condor/StartLog 09/29/23 10:39:03 /etc/condor/config.d/19gpu.config 09/29/23 10:39:03 /etc/condor/config.d/30start_gpu.config 09/29/23 10:39:06 Local machine resource GPUs = 0 09/29/23 10:39:06 Allocating auto shares for slot type 1: Cpus: 96.000000, Memory: 257000, Swap: auto, Disk: auto, GPUs: auto 09/29/23 10:39:06 slot type 1: Cpus: 96.000000, Memory: 257000, Swap: 100.00%, Disk: 100.00%, GPUs: 0 09/29/23 10:39:06 bind DevIds tag=GPUs contraint= 09/29/23 10:39:06 CronJobList: Adding job 'GPUs_MONITOR' 09/29/23 10:39:06 CronJob: Initializing job 'GPUs_MONITOR' (/usr/libexec/condor/condor_gpu_utilization) 19gpu.config only contains: use feature : GPUs GPU_DISCOVERY_EXTRA = -extra And 30start_gpu.config only contains: START = $(START) && ( (RequestGPUs >= 1) ) I thought it might be because of /usr/libexec/condor/condor_gpu_utilization, which does not seem to work for non CUDA cards: # /usr/libexec/condor/condor_gpu_utilization # Unable to load a CUDA library (libcuda.so or libcudart.so). Hanging to prevent process churn. But I think I managed to disable that by expanding the ‘use feature:GPUs’ and removing the ‘use feature:GpuMonitor’. I’m now stuck. I have a very vague recollection that when I first got some NVidia cards they showed up as OpenCL devices. Did I do something then to make them show up as CUDA devices that’s preventing these devices showing
up? Condor_config_val -dump doesn’t show any likely suspects. It’s entirely possible I haven’t got the drivers and/or software correctly installed but rocm-smi and rocminfo do see them as expected. Thanks, Chris. |