[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] ENOENT writing to cgroup.subtree_control, but file exists



Hi all,

We just discovered that HTCondor consistently fails to create cgroups (v2) for jobs in our cluster. However, Iâm at a loss of what is causing this.

The log StarterLog [0] first reports the htcondor cgroup as writeable, but then fails writing to it with ENOENT which seems to cause every other cgroup setup to fail as well. When I check the htcondor root cgroup tree file, it exists since ages [1].

Is the error report masking some other error that makes this fail? Are there any obvious steps we might have missed when preparing cgroups?

Weâre on RHEL8 (yes, we missed the RHEL9 train) and are running HTCondor 23.7.2. It looks like the relevant code [2] hasnât been changed in 23.8.1 so we havenât considered updating as a mitigation.

Cheers,
Max

[0] /var/log/condor/StarterLog.slot1_14
07/08/24 04:04:38 (pid:738500) (D_ALWAYS) Checking to see if htcondor is a writeable cgroup
07/08/24 04:04:38 (pid:738500) (D_ALWAYS)     Cgroup /htcondor is useable
...
07/08/24 04:04:38 (pid:738504) (D_ALWAYS) ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys/fs/cgroup/htcondor/cgroup.subtree_control: No such file or directory
07/08/24 04:04:38 (pid:738504) (D_ALWAYS) Error setting cgroup cpu weight of 800 in cgroup /sys/fs/cgroup/htcondor/condor_tmp_condor_execute_slot1_14@xxxxxxxxxxxxxxxxxxxxx: No such file or directory
07/08/24 04:04:38 (pid:738504) (D_ALWAYS) Error enabling per-cgroup oom killing: 2 (No such file or directory)

[1] ls -l /sys/fs/cgroup/htcondor/cgroup.subtree_control
-rw-r--r-- 1 root root 0 Jun 21 18:16 /sys/fs/cgroup/htcondor/cgroup.subtree_control

[2] https://github.com/htcondor/htcondor/blob/8cf018d14d7e198ffb1f3535326a3d8a22b52186/src/condor_utils/proc_family_direct_cgroup_v2.cpp#L181-L198

Attachment: smime.p7s
Description: S/MIME cryptographic signature