[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] cgroups and OOM killers
- Date: Thu, 6 Jul 2023 17:01:10 +0200
- From: Mary Hester <maryh@xxxxxxxxx>
- Subject: [HTCondor-users] cgroups and OOM killers
Hello HTCondor experts,
We're seeing some interesting behaviour with user jobs on our local
HTCondor cluster, running version 9.8.
Basically, if a job in the cgroup manages to go sufficiently over memory
so that the container cannot allocate accountable memory that is needed
for basic functioning of the system as a whole (e.g. to hold its
cmdline), then the container has impact on the whole system and will
bring it down. This is a worse condition than condor not being able to
fully get the status/failure reason for any single specific container.
And since oom_kill_disable is set to 1, the kernel will now not
intervene and hence the entire system grinds to a halt. It is preferable
to loose state for a single job, have the kernel do its thing, and have
the system survive. Now, the only workaround is to run for i in
/sys/fs/cgroup/memory/htcondor/condor*/memory.oom_control ; do echo 0 >
$i ; done in a loop to ensure the sysadmin-intended settings are applied
to the condor-managed cgroups.
Is there a configurable setting for oom_kill_disable 0? Shouldn't this
be an option or was there another reason for the oom_kill_disable being
set to 1?
Thanks,
Mary