Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] HTCondor Cgroups V2 configuration issues
- Date: Thu, 15 Aug 2024 17:18:03 +0200
- From: Matthias Schnepf <matthias.schnepf@xxxxxxx>
- Subject: Re: [HTCondor-users] HTCondor Cgroups V2 configuration issues
Hi,
I'm not sure why at point 6 of your "cgroup.subtree_control" file is
empty and what manages it (condor or systemd, I think).
We have a similar problem that the cgroup.the controller does not get
set correctly.
I hope someone else has an idea to fix your/our problem with the empty
"cgroup.subtree_control" file.
But here an idea of our "quick fix" we currently use.
We use the development version of condor (23.7.2) and RHEL8.
Our condor settings for cgroup v2 are:
BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = custom
CGROUP_HARD_MEMORY_LIMIT_EXPR = 2 * Target.RequestMemory
CGROUP_LOW_MEMORY_LIMIT = 0.75 * Target.RequestMemory
The job cgroups are created in /sys/fs/cgroup/htcondor. We set the
cgroup.subtree_control file via a cronjob at boot time.
#!/bin/bash
echo +cpu +cpuset +memory +pids >> /sys/fs/cgroup/cgroup.subtree_control
export cgroup_name="/sys/fs/cgroup/htcondor"
if [ ! -d ${cgroup_name} ]; then
ÂÂÂ mkdir ${cgroup_name}
fi
echo +cpu +cpuset +memory +pids >>
/sys/fs/cgroup/htcondor/cgroup.subtree_control
With that, CPU, memory, and pids controller are set for the htcondor
cgroup and its jobs/subgroups. With that, condor sets the correct memory
limits, CPU weights, and monitors the memory.
Best regards,
Matthias
On 8/15/24 4:47 PM, Sanjay Kumar Srikakulam wrote:
Hi,
We run an HTCondor cluster and recently noticed we are missing the
Cgroups accounting. Our setup,
HTCondor:
$CondorVersion: 23.0.6 2024-03-14 BuildID: 720565 PackageID: 23.0.6-1 $
$CondorPlatform: x86_64_AlmaLinux9 $
1. We are using Rocky 9 on workers
2. CgroupV2 is mounted on the workers
3. CgroupV2 controllers file as the list: "cpuset cpu io memory
hugetlb pids rdma misc"
4. HTCondor is configured to use CGroups:
BASE_CGROUP = system.slice/condor.service
CGROUP_MEMORY_LIMIT_POLICY = hard
RESERVED_MEMORY = 2048
5. I can see the "condor.service" directory under
"/sys/fs/cgroup/system.slice"
6. HTCondor is inheriting the parent controllers properly: I see the
"cgroup.controllers" file and has the same list of controllers as the
parent (above). However, the "cgroup.subtree_control" file is empty
(the parent has the list of controller names; so this is not getting
created or inherited properly)
7. As per the HTCondor doc
(https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#cgroup-based-process-tracking),
that once the BASE_CGROUP is defined, for every condor job there will
be a dedicated dir in the BASE_CGROUP path for cgroup accounting. When
jobs are submitted, I see the subdirectories
"condor_var_lib_condor_execute_slot1_7@hostname". However, the
"cgroup.controllers" file is empty in these sub-directories and is
somehow not inheriting the parent. Similarly, the
"cgroup.subtree_control" file is also empty.
8. We also added the "CREATE_CGROUP_WITHOUT_ROOT = True" to our
HTCondor config and restarted the condor services without luck.
9. Also, from the starter log: "StarterLog.slot1_1:08/15/24 14:21:09
(pid:3758318) ProcFamilyDirectCgroupV2::track_family_via_cgroup error
writing to
/sys/fs/cgroup/system.slice/condor.service/cgroup.subtree_control:
Device or resource busy", HTCondor seem to be hitting the "no internal
processes" rule
(https://unix.stackexchange.com/questions/680167/ebusy-when-trying-to-add-process-to-cgroup-v2;
https://manpath.be/f35/7/cgroups#L557).
Any help on resolving this is much appreciated!
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/