Good to hear. And thank you for sharing your solution.. From: Chris Brew - STFC UKRI <chris.brew@xxxxxxxxxx>
Thanks TJ, though condor_gpu_discovery does find the OpenCL GPUs without that. However, testing that might just have led me to stumble on the answer, spot the difference: $ /usr/libexec/condor/condor_gpu_discovery -opencl -extra DetectedGPUs=0 $ sudo /usr/libexec/condor/condor_gpu_discovery -opencl -extra DetectedGPUs="OCL0, OCL1" Common=[ ClockMhz=1700; ComputeUnits=104; DeviceName="gfx90a:sramecc+:xnack-"; ECCEnabled=false; GlobalMemoryMb=65520; OpenCLVersion=2.0; ] OCL0=[ id="OCL0"; ] OCL1=[ id="OCL1"; ] This works without privileges: $ rocm-smi ======================= ROCm System Management Interface ======================= ================================= Concise Info ================================= GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 32.0c 41.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0% 1 32.0c 40.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0% ================================================================================ ============================= End of ROCm SMI Log ============================== But not this: $ rocminfo ROCk module is loaded Unable to open /dev/kfd read-write: Permission denied brew is not member of "video" group, the default DRM access group. Users must be a member of the "video" group or another DRM access group in order for ROCm applications to run successfully. But I want anyone I let onto the host use the GPUs, that’s sort of the point: $ ls -l /dev/kfd crw-rw---- 1 root video 241, 0 Sep 22 06:38 /dev/kfd [brew@hepacc13 ~]$ sudo chmod o+rw /dev/kfd And now this works: $ rocminfo ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 One quick condor restart later: $ condor_status -l hepacc13 | grep -i gpu AssignedGPUs = "OCL0,OCL1" AvailableGPUs = { GPUs_OCL0,GPUs_OCL1 } ChildGPUs = { } DetectedGPUs = "OCL0, OCL1" GPUs = 2 GPUs_ClockMhz = 1700 GPUs_ComputeUnits = 104 GPUs_DeviceName = "gfx90a:sramecc+:xnack-" … Not a condor problem, sorry for the noise and thank you for the sounding board. Yours, Chris. |