[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Condor23/cgroups v2occassionally busy and/or kernel OOM acting



Hi all,

we have been debugging with a user his jobs as these tend to get somewhat randomly OOM killed. It seems to be cgroups v2 related, i.e., on our EL9/Condor23/cgroups v2 workers [1], where the cgroup mount path is
   /sys/fs/cgroup/htcondor/
with
  BASE_CGROUP = htcondor

Confusingly, on an evacuated node except fo the user, an interactive job with the payload in it (where the payload had been killed non-interactively before on the same node) run successfully to its end. While the job itself and the ssh-to-job sub-cgroup were nominally created, the PIDs seem to have not been added to the sub-proc list [2]. More confusingly, the parent cgroup seems to have been the root group (without the mem accounting virt files) but not the htcondor group?

Another non-interactive job got OOM'd pretty readily - however, there the job cgroup's proc virtual files were also not writable or were busy [3]?? Maybe some kind of race condition, where the job cgroups are not ready in the jobs initialization or so?

Cheers,
  Thomas

[1]
Linux batch1552.desy.de 5.14.0-362.24.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 15 07:18:13 EST 2024 x86_64 x86_64 x86_64 GNU/Linux

condor-23.0.8-1.el9.x86_64
condor-stash-plugin-6.12.1-1.x86_64
python3-condor-23.0.8-1.el9.x86_64
systemd-252-18.el9.x86_64
systemd-libs-252-18.el9.x86_64
systemd-pam-252-18.el9.x86_64
systemd-rpm-macros-252-18.el9.noarch
systemd-udev-252-18.el9.x86_64

[2]
05/31/24 13:09:28 (pid:512777) Using wrapper /var/lib/condor/util/job_wrapper.sh to exec /usr/sbin/sshd -i -e -f /var/lib/condor/execute/dir_512777/.condor_ssh_to_job_1/sshd_config 05/31/24 13:09:28 (pid:512777) ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys/fs/cgroup/htcondor/condor_var_lib_condor_execute_slot2_1@xxxxxxxxxxxxxxxxx/cgroup.subtree_control: Device or resource busy 05/31/24 13:09:28 (pid:512777) Error setting cgroup cpu weight of 100 in cgroup /sys/fs/cgroup/htcondor/condor_var_lib_condor_execute_slot2_1@xxxxxxxxxxxxxxxxx/sshd: No such file or directory 05/31/24 13:09:28 (pid:512777) Error enabling per-cgroup oom killing: 2 (No such file or directory)
05/31/24 13:09:28 (pid:512777) Create_Process succeeded, pid=513051
05/31/24 13:09:28 (pid:512777) ProcFamilyDirectCgroupV2::has_been_oom_killed cannot open /sys/fs/cgroup/memory.events: 2 No such file or directory

[3]
05/31/24 15:25:26 (pid:573820) Using wrapper /var/lib/condor/util/job_wrapper.sh to exec /bin/sh -c sleep' '180' '&&' 'while' 'test' '-d' '${_CONDOR_SCRATCH_DIR}/.condor_ssh_to_job_1;' 'do' '/bin/sleep' '3;' 'done 05/31/24 15:25:26 (pid:573820) ProcFamilyDirectCgroupV2::track_family_via_cgroup error removing cgroup htcondor/condor_var_lib_condor_execute_slot2_1@xxxxxxxxxxxxxxxxx: Device or resource busy 05/31/24 15:25:26 (pid:573820) Error writing procid 573822 to /sys/fs/cgroup/htcondor/condor_var_lib_condor_execute_slot2_1@xxxxxxxxxxxxxxxxx/cgroup.procs: Device or resource busy 05/31/24 15:25:26 (pid:573820) Create_Process: error tracking family with root 573822 via cgroup htcondor/condor_var_lib_condor_execute_slot2_1@xxxxxxxxxxxxxxxxx 05/31/24 15:25:26 (pid:573820) ProcFamilyDirectCgroupV2::unregister_family error removing cgroup htcondor/condor_var_lib_condor_execute_slot2_1@xxxxxxxxxxxxxxxxx: Device or resource busy
...
condor_var_lib_condor_execute_slot2_1@xxxxxxxxxxxxxxxxx: Device or resource busy

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature