Hi,We run an HTCondor cluster and recently noticed we are missing the Cgroups accounting. Our setup,
HTCondor: $CondorVersion: 23.0.6 2024-03-14 BuildID: 720565 PackageID: 23.0.6-1 $ $CondorPlatform: x86_64_AlmaLinux9 $ 1. We are using Rocky 9 on workers 2. CgroupV2 is mounted on the workers3. CgroupV2 controllers file as the list: "cpuset cpu io memory hugetlb pids rdma misc"
4. HTCondor is configured to use CGroups: BASE_CGROUP = system.slice/condor.service CGROUP_MEMORY_LIMIT_POLICY = hard RESERVED_MEMORY = 20485. I can see the "condor.service" directory under "/sys/fs/cgroup/system.slice" 6. HTCondor is inheriting the parent controllers properly: I see the "cgroup.controllers" file and has the same list of controllers as the parent (above). However, the "cgroup.subtree_control" file is empty (the parent has the list of controller names; so this is not getting created or inherited properly) 7. As per the HTCondor doc (https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#cgroup-based-process-tracking), that once the BASE_CGROUP is defined, for every condor job there will be a dedicated dir in the BASE_CGROUP path for cgroup accounting. When jobs are submitted, I see the subdirectories "condor_var_lib_condor_execute_slot1_7@hostname". However, the "cgroup.controllers" file is empty in these sub-directories and is somehow not inheriting the parent. Similarly, the "cgroup.subtree_control" file is also empty.
8. We also added the "CREATE_CGROUP_WITHOUT_ROOT = True" to our HTCondor config and restarted the condor services without luck. 9. Also, from the starter log: "StarterLog.slot1_1:08/15/24 14:21:09 (pid:3758318) ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys/fs/cgroup/system.slice/condor.service/cgroup.subtree_control: Device or resource busy", HTCondor seem to be hitting the "no internal processes" rule (https://unix.stackexchange.com/questions/680167/ebusy-when-trying-to-add-process-to-cgroup-v2; https://manpath.be/f35/7/cgroups#L557).
Any help on resolving this is much appreciated! -- Sanjay Kumar Srikakulam Bioinformatics Group Department of Computer Science University of Freiburg Georges-KÃhler-Allee 079 D-79110 Freiburg European Galaxy Team https://galaxyproject.eu https://usegalaxy.eu
Attachment:
OpenPGP_0x9E3F764501D8A8FE.asc
Description: OpenPGP public key
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature