Hi,Â
We've been upgrading our cluster from a rather old version of Condor to 24.0.13 and we've ran into an issue where jobs that request GPUs are not able to run workloads on the assigned GPU.Â
All the environmentÂvariables you would expect are there;Â_CONDOR_AssignedGPUs,ÂCUDA_VISIBLE_DEVICES, etc. However, trying to run things like `condor_gpu_discovery` `nvidia-smi` or running a GPU workload complain about there being no visible devices. Running these same command locally on the execute point work just fine however.
Digging through the starter logs we noticedÂ
` cgroup v2 could not attach gpu device limiter to cgroup: Operation not permitted`
if (attach_result != 0) {
  dprintf(D_ALWAYS, "cgroup v2 could not attach gpu device limiter to cgroup: %s\n", strerror(errno));
  close(cgroup_fd);
  close(bpf_fd);
  return false;
}
I'm Not particularly familiar with cgroup v2 and bpf together, but messing around with bpftool, we noticed that it seems like maybe old handles are kept around from previous cgroups on the slot? At least doing a full drain of the startd allows the first job of the slot to use the assigned GPU.
Any help or thoughts what might be going on here would be appreciated, and happy to add additional output
Alec
_______________________________________________