[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor Cgroups V2 configuration issues



Hey Matthias,

Thank you for sharing! I thought of something similar to your script as a "quick fix" to resolve the problem temporarily.

Clarification:

The "cgroup.subtree_control" under "/sys/fs/cgroup/" and "/sys/fs/cgroup/system.slice" are created correctly.

Our BASE_CGROUP = system.slice/condor.service

Basically:

/sys/fs/cgroup/
    âââ cgroup.controllers 
    âââ cgroup.subtree_control
    âââ system.slice/
        âââ cgroup.controllers
        âââ cgroup.subtree_control
        âââ condor.service/
            âââ cgroup.controllers
            âââ cgroup.subtree_control (empty)
            âââ <HTCondor jobs/subgroups>/
                âââ cgroup.controllers (empty)
                âââ cgroup.subtree_control (empty)

I hope this adds more clarity to my question. Not sure why HTCondor is not inheriting the parent "cgroup.subtree_control" correctly from the "system.slice" and probably this is the reason why the job/subgroup specific dirs are not getting configured properly. I will set up a test instance and see if the "quick fix" works for me. I hope someone has a fix to our problem.


On 8/15/2024 5:18 PM, Matthias Schnepf wrote:
Hi,

I'm not sure why at point 6 of your "cgroup.subtree_control" file is empty and what manages it (condor or systemd, I think).
We have a similar problem that the cgroup.the controller does not get set correctly.
I hope someone else has an idea to fix your/our problem with the empty "cgroup.subtree_control" file.

But here an idea of our "quick fix" we currently use.
We use the development version of condor (23.7.2) and RHEL8.
Our condor settings for cgroup v2 are:

BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = custom
CGROUP_HARD_MEMORY_LIMIT_EXPR = 2 * Target.RequestMemory
CGROUP_LOW_MEMORY_LIMIT = 0.75 * Target.RequestMemory

The job cgroups are created in /sys/fs/cgroup/htcondor. We set the cgroup.subtree_control file via a cronjob at boot time.


#!/bin/bash

echo +cpu +cpuset +memory +pids >> /sys/fs/cgroup/cgroup.subtree_control
export cgroup_name="/sys/fs/cgroup/htcondor"
if [ ! -d ${cgroup_name} ]; then
ÂÂÂ mkdir ${cgroup_name}
fi
echo +cpu +cpuset +memory +pids >> /sys/fs/cgroup/htcondor/cgroup.subtree_control

With that, CPU, memory, and pids controller are set for the htcondor cgroup and its jobs/subgroups. With that, condor sets the correct memory limits, CPU weights, and monitors the memory.

Best regards,

Matthias


On 8/15/24 4:47 PM, Sanjay Kumar Srikakulam wrote:
Hi,

We run an HTCondor cluster and recently noticed we are missing the Cgroups accounting. Our setup,

HTCondor:

$CondorVersion: 23.0.6 2024-03-14 BuildID: 720565 PackageID: 23.0.6-1 $
$CondorPlatform: x86_64_AlmaLinux9 $

1. We are using Rocky 9 on workers
2. CgroupV2 is mounted on the workers
3. CgroupV2 controllers file as the list: "cpuset cpu io memory hugetlb pids rdma misc"
4. HTCondor is configured to use CGroups:

BASE_CGROUP = system.slice/condor.service
CGROUP_MEMORY_LIMIT_POLICY = hard
RESERVED_MEMORY = 2048

5. I can see the "condor.service" directory under "/sys/fs/cgroup/system.slice"
6. HTCondor is inheriting the parent controllers properly: I see the "cgroup.controllers" file and has the same list of controllers as the parent (above). However, the "cgroup.subtree_control" file is empty (the parent has the list of controller names; so this is not getting created or inherited properly)
7. As per the HTCondor doc (https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#cgroup-based-process-tracking), that once the BASE_CGROUP is defined, for every condor job there will be a dedicated dir in the BASE_CGROUP path for cgroup accounting. When jobs are submitted, I see the subdirectories "condor_var_lib_condor_execute_slot1_7@hostname". However, the "cgroup.controllers" file is empty in these sub-directories and is somehow not inheriting the parent. Similarly, the "cgroup.subtree_control" file is also empty.

8. We also added the "CREATE_CGROUP_WITHOUT_ROOT = True" to our HTCondor config and restarted the condor services without luck.
9. Also, from the starter log: "StarterLog.slot1_1:08/15/24 14:21:09 (pid:3758318) ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys/fs/cgroup/system.slice/condor.service/cgroup.subtree_control: Device or resource busy", HTCondor seem to be hitting the "no internal processes" rule (https://unix.stackexchange.com/questions/680167/ebusy-when-trying-to-add-process-to-cgroup-v2; https://manpath.be/f35/7/cgroups#L557).

Any help on resolving this is much appreciated!





_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
-- 
Sanjay Kumar Srikakulam
Bioinformatics Group
Department of Computer Science
University of Freiburg
Georges-KÃhler-Allee 079
D-79110 Freiburg

European Galaxy Team
https://galaxyproject.eu
https://usegalaxy.eu

Attachment: OpenPGP_0x9E3F764501D8A8FE.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature