[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job cannot access gpu - "cgroup v2 could not attach gpu device limiter to cgroup: Operation not permitted"



Hi Alec,

We found a similar issue, although it doesnât seem to be exactly the same as yours. In our case, it was caused by having the -not-nested option in GPU_DISCOVERY_EXTRA and STARTER_HIDE_GPU_DEVICES set to True. When we removed the -not-nested option, everything worked correctly.

Do you have something similar in your configuration? If you set STARTER_HIDE_GPU_DEVICES to False, do your jobs run and detect the GPU properly?

Best regards,

Carles

On Tue, 28 Oct 2025 at 20:49, Alec Sheperd <alec.sheperd@xxxxxxxxxxxxxxxx> wrote:
Hi,Â

We've been upgrading our cluster from a rather old version of Condor to 24.0.13 and we've ran into an issue where jobs that request GPUs are not able to run workloads on the assigned GPU.Â

All the environmentÂvariables you would expect are there;Â_CONDOR_AssignedGPUs,ÂCUDA_VISIBLE_DEVICES, etc. However, trying to run things like `condor_gpu_discovery` `nvidia-smi` or running a GPU workload complain about there being no visible devices. Running these same command locally on the execute point work just fine however.

Digging through the starter logs we noticedÂ
` cgroup v2 could not attach gpu device limiter to cgroup: Operation not permitted`

Looking where the error comes from seems to originate from `install_bpf_gpu_filter` inÂhttps://github.com/htcondor/htcondor/blob/25e4e13190bff70d6c89e8d0577a1be5c88fcfe5/src/condor_utils/proc_family_direct_cgroup_v2.cpp#L463

if (attach_result != 0) {
  dprintf(D_ALWAYS, "cgroup v2 could not attach gpu device limiter to cgroup: %s\n", strerror(errno));
  close(cgroup_fd);
  close(bpf_fd);
  return false;
}

I'm Not particularly familiar with cgroup v2 and bpf together, but messing around with bpftool, we noticed that it seems like maybe old handles are kept around from previous cgroups on the slot? At least doing a full drain of the startd allows the first job of the slot to use the assigned GPU.

Any help or thoughts what might be going on here would be appreciated, and happy to add additional output

Alec
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es