Hey Matthias,
Thank you for sharing! I thought of something similar to your script as a "quick fix" to resolve the problem temporarily.
Clarification:
The "cgroup.subtree_control" under "/sys/fs/cgroup/" and "/sys/fs/cgroup/system.slice" are created correctly.
Our BASE_CGROUP = system.slice/condor.service
Basically:
/sys/fs/cgroup/ âââ cgroup.controllers âââ cgroup.subtree_control âââ system.slice/ âââ cgroup.controllers âââ cgroup.subtree_control âââ condor.service/ âââ cgroup.controllers âââ cgroup.subtree_control (empty) âââ <HTCondor jobs/subgroups>/ âââ cgroup.controllers (empty) âââ cgroup.subtree_control (empty)
I hope this adds more clarity to my question. Not sure why HTCondor is not inheriting the parent "cgroup.subtree_control" correctly from the "system.slice" and probably this is the reason why the job/subgroup specific dirs are not getting configured properly. I will set up a test instance and see if the "quick fix" works for me. I hope someone has a fix to our problem.
Hi,
I'm not sure why at point 6 of your "cgroup.subtree_control" file is empty and what manages it (condor or systemd, I think).
We have a similar problem that the cgroup.the controller does not get set correctly.
I hope someone else has an idea to fix your/our problem with the empty "cgroup.subtree_control" file.
But here an idea of our "quick fix" we currently use.
We use the development version of condor (23.7.2) and RHEL8.
Our condor settings for cgroup v2 are:
BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = custom
CGROUP_HARD_MEMORY_LIMIT_EXPR = 2 * Target.RequestMemory
CGROUP_LOW_MEMORY_LIMIT = 0.75 * Target.RequestMemory
The job cgroups are created in /sys/fs/cgroup/htcondor. We set the cgroup.subtree_control file via a cronjob at boot time.
#!/bin/bash
echo +cpu +cpuset +memory +pids >> /sys/fs/cgroup/cgroup.subtree_control
export cgroup_name="/sys/fs/cgroup/htcondor"
if [ ! -d ${cgroup_name} ]; then
ÂÂÂ mkdir ${cgroup_name}
fi
echo +cpu +cpuset +memory +pids >> /sys/fs/cgroup/htcondor/cgroup.subtree_control
With that, CPU, memory, and pids controller are set for the htcondor cgroup and its jobs/subgroups. With that, condor sets the correct memory limits, CPU weights, and monitors the memory.
Best regards,
Matthias
On 8/15/24 4:47 PM, Sanjay Kumar Srikakulam wrote:
Hi,_______________________________________________
We run an HTCondor cluster and recently noticed we are missing the Cgroups accounting. Our setup,
HTCondor:
$CondorVersion: 23.0.6 2024-03-14 BuildID: 720565 PackageID: 23.0.6-1 $
$CondorPlatform: x86_64_AlmaLinux9 $
1. We are using Rocky 9 on workers
2. CgroupV2 is mounted on the workers
3. CgroupV2 controllers file as the list: "cpuset cpu io memory hugetlb pids rdma misc"
4. HTCondor is configured to use CGroups:
BASE_CGROUP = system.slice/condor.service
CGROUP_MEMORY_LIMIT_POLICY = hard
RESERVED_MEMORY = 2048
5. I can see the "condor.service" directory under "/sys/fs/cgroup/system.slice"
6. HTCondor is inheriting the parent controllers properly: I see the "cgroup.controllers" file and has the same list of controllers as the parent (above). However, the "cgroup.subtree_control" file is empty (the parent has the list of controller names; so this is not getting created or inherited properly)
7. As per the HTCondor doc (https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#cgroup-based-process-tracking), that once the BASE_CGROUP is defined, for every condor job there will be a dedicated dir in the BASE_CGROUP path for cgroup accounting. When jobs are submitted, I see the subdirectories "condor_var_lib_condor_execute_slot1_7@hostname". However, the "cgroup.controllers" file is empty in these sub-directories and is somehow not inheriting the parent. Similarly, the "cgroup.subtree_control" file is also empty.
8. We also added the "CREATE_CGROUP_WITHOUT_ROOT = True" to our HTCondor config and restarted the condor services without luck.
9. Also, from the starter log: "StarterLog.slot1_1:08/15/24 14:21:09 (pid:3758318) ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys/fs/cgroup/system.slice/condor.service/cgroup.subtree_control: Device or resource busy", HTCondor seem to be hitting the "no internal processes" rule (https://unix.stackexchange.com/questions/680167/ebusy-when-trying-to-add-process-to-cgroup-v2; https://manpath.be/f35/7/cgroups#L557).
Any help on resolving this is much appreciated!
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
-- Sanjay Kumar Srikakulam Bioinformatics Group Department of Computer Science University of Freiburg Georges-KÃhler-Allee 079 D-79110 Freiburg European Galaxy Team https://galaxyproject.eu https://usegalaxy.eu
Attachment:
OpenPGP_0x9E3F764501D8A8FE.asc
Description: OpenPGP public key
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature