Hi Charles et all,
Some quick thoughts re the below noisy neighbor / cgroup memory
issues:
1. Are you running on Debian or Ubuntu? If yes, be a bug re
enforcement of memory limits on those Linux distros was just fixed
last month in HTCondor v10.0.1+ in LTS channel, and HTCondor
v10.2.0+ in feature channel. See the release notes, or the nerdy
details here:
https://opensciencegrid.atlassian.net/browse/HTCONDOR-1466, but
the upshot is nothing related to memory enforcement was working
correctly on those distros until recently. (other distros like
Centos, Fedora, Red Hat were fine).
2. Personally I am not a fan of
"CGROUP_MEMORY_LIMIT_POLICY=soft". Setting the policy to "soft"
will always result in badly configured jobs penalizing good
citizens. Even worse IMHO is it leads to non-deterministic and
unpredictable behavior from the perspective of end users, e.g.
"hey admin, my job ran just fine to completion last week but it
got killed this week, why???". Better to leave
CGROUP_MEMORY_LIMIT_POLICY=hard (the default), and make things
clear and predictable for users: if your job uses more memory than
you requested at submit time, it will be killed.
3. See the Manual for the CGROUP_MEMORY_LIMIT_POLICY knob to
understand how the soft and hard limits are set (you can also
customize them yourself if you are an expert). The default 'hard'
policy sets the hard limit at the size of the slot and the soft
limit at 90% of the size of the slot.
4. By default, HTCondor should attempt to direct the OOM away from
jobs that are using less than 90% of the cgroup soft limit on the
slot --- considering the default soft limit is 90% of the slot
memory (see above), this effectively means if the job is at or
below 80% of its requested memory it wont be killed by the OOM.
Why go with 90% of the soft limit -vs- 100% you ask?
Unfortunately, it appears that the OOM killer sees a different
(larger) value for usage than reported by cgroups.... (not sure
why, perhaps it includes memory used in kernel data structures
etc). For the nerds, here is where that happens:
https://github.com/htcondor/htcondor/blob/main/src/condor_starter.V6.1/vanilla_proc.cpp#L1275-L1311
5. Cgroup memory limits are limits, not reservations. By
default, HTCondor considers all the physical memory of your
machine as available to be used by HTCondor jobs. If some other
services/processes outside of HTCondor is pulling the "memory rug"
out from underneath the startd, all bets are off and who knows
what the OOM will kill. To tell HTCondor about memory consumption
for services running on the server outside of HTCondor, you really
must use the config knob RESERVED_MEMORY. Common memory stealing
culprits are other daemons running on the machine (web proxy
services, puppet/chef, etc), and/or shared filesystem services
including FUSE mounts etc.
Hope the above ramblings are helpful,
Todd
On 2/15/2023 12:49 PM, Charles Goyard wrote: