Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none
- Date: Tue, 26 Dec 2023 15:22:35 -0600
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none
On 12/26/23 14:26, JM wrote:
Based on limited information from worker logs, the chance is very high
that the server indeed ran out of physical and swap memory. Multiple
jobs of the same (high) memory usage pattern hit the server at the
same time. One of them was terminated by OOM killer. I was confused by
the startd log message which said about the job memory usage
threshold. TheÂmessage gave the impressionÂthat the job was killed by
a policy. If I remember correctly, a more typical feedback from
HTCondor is that the job was terminated with return value 137.
Hi JM:
This is something that has changed in HTCondor. In the past, if cgroups
were not enabled, and the OOM killer killed a job (because the system as
a whole was out of memory), the job might exit the queue by default, as
it just looked to HTCondor like the job was killed with signal 9,
perhaps by something within the job proper.
Our philosophy is that the job should not leave the queue if something
happened to it outside of its control. e.g. if it is running on a
worker node that gets rebooted, by default the job should start again
somewhere else. Not the job's fault the node was rebooted. Now, if the
OOM killer kills the process not because the job is over the per-cgroup
limit, but because the system as a whole is out of memory, we want to
treat that the same way.
I agree that the message is confusing, and I'll work on cleaning that up.
Thanks,
-greg