Hi all,
After a while, I came back to look into this issue, but I am completely lost. We are running HTCondor 24.0.11 on AlmaLinux 9 machines. I canât see why STARTER_HIDE_GPU_DEVICES=True is not working for our GPUs, while it works fine in other sites.
In the logs I see:
10/06/25 13:11:13 (pid:1) Successfully moved procid 1 to cgroup /sys/fs/cgroup/system.slice/htcondor/
condor_home_execute_slot2_1@xxxxxxxxxxxx/cgroup.procs10/06/25 13:11:13 (pid:1) cgroup v2 successfully installed bpf program to limit access to devices
10/06/25 13:11:13 (pid:10613) Create_Process succeeded, pid=10615
But if I try to run nvidia-smi inside an HTcondor job (or any other GPU execution):
No devices were found
On the other hand, the CUDA_VISIBLE_DEVICES variable is correct.ÂÂ
It feels like this is related to our cgroups v2 setup or the GPU driver, but after checking many things, I donât know where else to look. Do you have any suggestions?
Best regards,
Carles