Hi Joachim, ah, that you mean - that should not be a fine, I guess.The cgroup slice for the ssh_to_job is a child of the job slice within the cgroup hierarchy. In the cgroup logic, such a child slice is limited to the resource its parent has and can only get resources up to its parent total or less. E.g., The relative cpu share of such a ssh slice is only a fraction of the job slice. Let's say, the job slice has nominally 6% of the node's overall CPU time. Then let's set for the ssh child slice 50% of the CPU share, then these are just 50% of the 6% of its parent ~~> 3% of the node's total at max (assuming that there are no idle cycles etc. pp.)
Cheers, Thomas On 27/05/2025 10.32, Joachim Meyer wrote:
Hi Thomas, thanks for reaching out!I meant the cgroup restrictions that HTCondor itself imposes - CPU/ Memory limits - that usually also includes restrictions to the devices cgroup (https://htcondor.readthedocs.io/en/latest/admin-manual/ configuration-macros.html#STARTER_HIDE_GPU_DEVICES <https:// htcondor.readthedocs.io/en/latest/admin-manual/configuration- macros.html#STARTER_HIDE_GPU_DEVICES>)HTCondor seems to fail at moving the sshd process into the job's cgroup slice and thus these restrictions don't apply:05/21/25 14:30:21 About to exec /usr/sbin/sshd -i -e -f /raid/condor/lib/condor/execute/dir_203901/.condor_ssh_to_job_1/sshd_config> 05/21/25 14:30:21 ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys/fs/cgroup/system.slice/htcondor/ condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/ cgroup.subtree_control: Device or resource busy> 05/21/25 14:30:21 Creating cgroup system.slice/htcondor/ condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/ sshd for pid 204072> 05/21/25 14:30:21 Successfully moved procid 204072 to cgroup /sys/fs/ cgroup/system.slice/htcondor/ condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/ sshd/cgroup.procs> 05/21/25 14:30:21 Error setting cgroup memory limit of 107374182400 in cgroup /sys/fs/cgroup/system.slice/htcondor/ condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/ sshd: No such file or directory> 05/21/25 14:30:21 Error setting cgroup swap limit of 107374182400 in cgroup /sys/fs/cgroup/system.slice/htcondor/ condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/ sshd: No such file or directory> 05/21/25 14:30:21 Error setting cgroup cpu weight of 1200 in cgroup / sys/fs/cgroup/system.slice/htcondor/ condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/ sshd: No such file or directory> 05/21/25 14:30:21 Error enabling per-cgroup oom killing: 2 (No such file or directory)> 05/21/25 14:30:21 cgroup v2 could not attach gpu device limiter to cgroup: Operation not permittedAny ideas what might be causing this? Thanks! - JoachimAm Dienstag, 27. Mai 2025, 09:47:04 MitteleuropÃische Sommerzeit schrieb Thomas Hartmann:> Hi Joachim, >> > - condor_ssh_to_job leads to cgroup errors - which allows anything done> > here to escape the restrictions (e.g. I can see all GPUs with nvidia-smi> > here..) - I haven't found a difference here whether I used apptainer- > > suid or not. > > in principle, cgroups are not necessarily handled by > apptainer/singularity, which ael primarily with the namespaces. > > where do you restrict cgroups wrt to GPU(?) resources, i.e., what > controller do you use? > If you use drop-ins to the condor systemd unit, these seem not > necessarily be propagated to the job cgroup, if you keep them separated. > I.e., drop-ins affecting cgroup resourced work on the condor.service > slice, but depending on your `BASE_CGROUP` ad in the Condor config, this > is a separate slice, that does not inherit from the systemd service > unit's slice. > > Cheers, >ÂÂÂ Thomas > -- *Joachim Meyer* HPC-Koordination & Support UniversitÃt des Saarlandes /FR Informatik | HPC/ Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken Besucheranschrift: Campus E1 3 | Raum 4.03 66123 SaarbrÃcken T: +49 681 302-57522 jmeyer@xxxxxxxxxxxxxxxxxx www.uni-saarland.de
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature