[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job cannot access gpu - "cgroup v2 could not attach gpu device limiter to cgroup: Operation not permitted"



Hi Carles, Greg,

Thanks for the replies and info, yeah we had stumbled across the STARTER_HIDE_GPU_DEVICES and that had seemed to get us most of the way there, but settingÂGPU_DISCOVERY_EXTRA to only the -extra and removing -not-nested.

I suppose I have a couple questions in regards to these two things:

1. The cgroups isolationÂof GPU devices with bpf does seem useful, and could be something we'd like to implement in the future on our cluster, so any help to get that configured correctly wouldÂbe great.Â

2. The main reason I think we had been trying to use the -not-nested flag was purely for job requests as it seems like that ads are a little different for what you might expect from what we'd see before. DynamicÂslots do not display most GPU ads, in fact it seems like it doesn't even display gpus on the slot. The partitionable slot on the execute point does have all these ads, but that might cause some problems with some of our monitoring. I assume some of this though can be change via some simple configuration.

Thanks
Alec

On Wed, Oct 29, 2025 at 10:12âAM Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
On 10/29/25 01:10, Carles Acosta wrote:
> Hi Alec,
>
> We found a similar issue, although it doesnât seem to be exactly the
> same as yours. In our case, it was caused by having the -not-nested
> option in GPU_DISCOVERY_EXTRA and STARTER_HIDE_GPU_DEVICES set to
> True. When we removed the -not-nested option, everything worked correctly.
>
> Do you have something similar in your configuration? If you set
> STARTER_HIDE_GPU_DEVICES to False, do your jobs run and detect the GPU
> properly?
>
In addition to what Carles said, htcondor is designed to give each job a
new cgroup, even if the previous job in that slot would have had the
same constraints, so I'm interested to hear if STARTER_HIDE_GPU_DEVICES
= false fixes the immediate problem.

-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/