Hi,
This is all with Condor 10.0.7 on Rocky Linux 8.
Iâve got a test node with a couple of AMD Instinct MI GPGPU cards (i.e. not CUDA) in but Iâm having no luck getting them to show up in the machine ClassAds.
Condor_gpu_discovery sees them fine:
# /usr/libexec/condor/condor_gpu_discovery -extra -properties
DetectedGPUs="OCL0, OCL1"
Common=[ ClockMhz=1700; ComputeUnits=104; DeviceName="gfx90a:sramecc+:xnack-"; ECCEnabled=false; GlobalMemoryMb=65520; OpenCLVersion=2.0; ]
OCL0=[ id="OCL0"; ]
OCL1=[ id="OCL1"; ]
But the StartD doesnât:
# grep -I gpu /var/log/condor/StartLog
09/29/23 10:39:03 /etc/condor/config.d/19gpu.config
09/29/23 10:39:03 /etc/condor/config.d/30start_gpu.config
09/29/23 10:39:06 Local machine resource GPUs = 0
09/29/23 10:39:06 Allocating auto shares for slot type 1: Cpus: 96.000000, Memory: 257000, Swap: auto, Disk: auto, GPUs: auto
09/29/23 10:39:06 slot type 1: Cpus: 96.000000, Memory: 257000, Swap: 100.00%, Disk: 100.00%, GPUs: 0
09/29/23 10:39:06 bind DevIds tag=GPUs contraint=
09/29/23 10:39:06 CronJobList: Adding job 'GPUs_MONITOR'
09/29/23 10:39:06 CronJob: Initializing job 'GPUs_MONITOR' (/usr/libexec/condor/condor_gpu_utilization)
19gpu.config only contains:
use feature : GPUs
GPU_DISCOVERY_EXTRA = -extra
And 30start_gpu.config only contains:
START = $(START) && ( (RequestGPUs >= 1) )
I thought it might be because of /usr/libexec/condor/condor_gpu_utilization, which does not seem to work for non CUDA cards:
# /usr/libexec/condor/condor_gpu_utilization
# Unable to load a CUDA library (libcuda.so or libcudart.so).
Hanging to prevent process churn.
But I think I managed to disable that by expanding the âuse feature:GPUsâ and removing the âuse feature:GpuMonitorâ.
Iâm now stuck. I have a very vague recollection that when I first got some NVidia cards they showed up as OpenCL devices. Did I do something then to make them show up as CUDA devices thatâs preventing these devices showing up? Condor_config_val -dump doesnât show any likely suspects.
Itâs entirely possible I havenât got the drivers and/or software correctly installed but rocm-smi and rocminfo do see them as expected.
Thanks,
Chris.