Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Re: Re: condor_ssh_to_job/interactive jobs with apptainer

Date: Tue, 27 May 2025 10:32:59 +0200
From: Joachim Meyer <jmeyer@xxxxxxxxxxxxxxxxxx>
Subject: [HTCondor-users] Re: Re: condor_ssh_to_job/interactive jobs with apptainer

Hi Thomas,

thanks for reaching out!

I meant the cgroup restrictions that HTCondor itself imposes - CPU/Memory limits - that usually also includes restrictions to the devices cgroup (https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#STARTER_HIDE_GPU_DEVICES)

HTCondor seems to fail at moving the sshd process into the job's cgroup slice and thus these restrictions don't apply:

> 05/21/25 14:30:21 About to exec /usr/sbin/sshd -i -e -f /raid/condor/lib/condor/execute/dir_203901/.condor_ssh_to_job_1/sshd_config

> 05/21/25 14:30:21 ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/cgroup.subtree_control: Device or resource busy

> 05/21/25 14:30:21 Creating cgroup system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/sshd for pid 204072

> 05/21/25 14:30:21 Successfully moved procid 204072 to cgroup /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs

> 05/21/25 14:30:21 Error setting cgroup memory limit of 107374182400 in cgroup /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/sshd: No such file or directory

> 05/21/25 14:30:21 Error setting cgroup swap limit of 107374182400 in cgroup /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/sshd: No such file or directory

> 05/21/25 14:30:21 Error setting cgroup cpu weight of 1200 in cgroup /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/sshd: No such file or directory

> 05/21/25 14:30:21 Error enabling per-cgroup oom killing: 2 (No such file or directory)

> 05/21/25 14:30:21 cgroup v2 could not attach gpu device limiter to cgroup: Operation not permitted

Any ideas what might be causing this?

Thanks!

- Joachim

Am Dienstag, 27. Mai 2025, 09:47:04 MitteleuropÃische Sommerzeit schrieb Thomas Hartmann:

> Hi Joachim,

> > - condor_ssh_to_job leads to cgroup errors - which allows anything done

> > here to escape the restrictions (e.g. I can see all GPUs with nvidia-smi

> > here..) - I haven't found a difference here whether I used apptainer-

> > suid or not.

> in principle, cgroups are not necessarily handled by

> apptainer/singularity, which ael primarily with the namespaces.

> where do you restrict cgroups wrt to GPU(?) resources, i.e., what

> controller do you use?

> If you use drop-ins to the condor systemd unit, these seem not

> necessarily be propagated to the job cgroup, if you keep them separated.

> I.e., drop-ins affecting cgroup resourced work on the condor.service

> slice, but depending on your `BASE_CGROUP` ad in the Condor config, this

> is a separate slice, that does not inherit from the systemd service

> unit's slice.

> Cheers,

> Thomas

>
--

Joachim Meyer

HPC-Koordination & Support

UniversitÃt des Saarlandes

FR Informatik | HPC

Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken

Besucheranschrift: Campus E1 3 | Raum 4.03

66123 SaarbrÃcken

T: +49 681 302-57522

jmeyer@xxxxxxxxxxxxxxxxxx

www.uni-saarland.de

Follow-Ups:
- Re: [HTCondor-users] condor_ssh_to_job/interactive jobs with apptainer
  - From: Thomas Hartmann

References:
- [HTCondor-users] condor_ssh_to_job/interactive jobs with apptainer
  - From: Joachim Meyer
- [HTCondor-users] Re: condor_ssh_to_job/interactive jobs with apptainer
  - From: Joachim Meyer
- Re: [HTCondor-users] condor_ssh_to_job/interactive jobs with apptainer
  - From: Thomas Hartmann

Prev by Date: [HTCondor-users] Re: condor_ssh_to_job/interactive jobs with apptainer
Next by Date: [HTCondor-users] failing to start job does not get held
Previous by thread: Re: [HTCondor-users] condor_ssh_to_job/interactive jobs with apptainer
Next by thread: Re: [HTCondor-users] condor_ssh_to_job/interactive jobs with apptainer
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] Re: Re: condor_ssh_to_job/interactive jobs with apptainer