Hi Joachim,
- condor_ssh_to_job leads to cgroup errors - which allows anything done here to escape the restrictions (e.g. I can see all GPUs with nvidia-smi here..) - I haven't found a difference here whether I used apptainer- suid or not.
in principle, cgroups are not necessarily handled by apptainer/singularity, which ael primarily with the namespaces.
where do you restrict cgroups wrt to GPU(?) resources, i.e., what controller do you use? If you use drop-ins to the condor systemd unit, these seem not necessarily be propagated to the job cgroup, if you keep them separated. I.e., drop-ins affecting cgroup resourced work on the condor.service slice, but depending on your `BASE_CGROUP` ad in the Condor config, this is a separate slice, that does not inherit from the systemd service unit's slice.
Cheers, Thomas
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature