Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] cgroups and OOM killers
- Date: Thu, 6 Jul 2023 10:42:05 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] cgroups and OOM killers
On 7/6/23 10:01, Mary Hester wrote:
Hello HTCondor experts,
We're seeing some interesting behaviour with user jobs on our local
HTCondor cluster, running version 9.8.
Basically, if a job in the cgroup manages to go sufficiently over
memory so that the container cannot allocate accountable memory that
is needed for basic functioning of the system as a whole (e.g. to
hold its cmdline), then the container has impact on the whole system
and will bring it down. This is a worse condition than condor not
being able to fully get the status/failure reason for any single
specific container. And since oom_kill_disable is set to 1, the kernel
will now not intervene and hence the entire system grinds to a halt.
It is preferable to loose state for a single job, have the kernel do
its thing, and have the system survive. Now, the only workaround is to
run for i in /sys/fs/cgroup/memory/htcondor/condor*/memory.oom_control
; do echo 0 > $i ; done in a loop to ensure the sysadmin-intended
settings are applied to the condor-managed cgroups.
Hi Mary:
I'm sorry your system is having problems. Perhaps what is happening is
that there is swap enabled on the system, and cgroups are limiting the
amount of physical memory used by the job, and the system is paging
itself to death before the starter can read the OOM message. Can you
try setting
DISABLE_SWAP_FOR_JOB = true
and see if the problem persists?
The reason condor sets oom_kill_disable to true is that the starter
registers to get notified of the OOM event, so that it can know that the
reason the job exitted was due to OOM kill. It sounds like perhaps the
system is so overloaded that this event isn't getting delivered or
processed.
Newer versions of HTCondor, and those with cgroup v2 don't set
oom_kill_disable, they wait for the cgroup to die, and there is first
class support in the cgroup for querying whether the OOM killer fired.Â
We hope this will be a more reliable method in the future.
Let us know how this goes,
-greg