Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Cgroups v2 and memory limits for WLCG sites
- Date: Thu, 8 Aug 2024 13:10:47 +0200
- From: Petr Vokac <petr.vokac@xxxxxxx>
- Subject: Re: [HTCondor-users] Cgroups v2 and memory limits for WLCG sites
Does also HTCondor developers have
opinion about cgroups v2 and how to "correctly" configure killing
jobs reaching `request_memory` from JDL without scarifying disk
I/O performance by unnecessary aggressively dropping page cache
(e.g. if there is a still lot of free memory on worker node,
because other jobs doesn't usually run with fully used
`request_memory` and it seems to me OS should decide how to use
disk cache to get maximum performance)?
Petr
On 7/25/24 10:23, Petr Vokac wrote:
Hi,
could you please clarify to us how to use memory limits with
HTCondor and cgroups v2? Do we understand correctly that cgroups
v2 account also page cache (e.g. disk buffers) to the job (process
tree) memory? Such behavior makes cgroups v2 unusable for
enforcing memory limits, because it is unpredictable how much page
cache is used by our jobs (less stressed machine => potentially
more memory accounted by job cgroups v2).
What are our options to enforce reasonable memory limits?
don't enforce memory limits by cgroups v2 at all as described
in
https://opensciencegrid.atlassian.net/browse/HTCONDOR-2521
sacrifice a bit of performance by aggressively dropping page
case with CGROUP_LOW_MEMORY_LIMIT. Which values should be used?
Do you have an idea what is the impact on performance?
other options? recommendation? Could cgroups v2 be configured
to enforce just process memory limits and don't include page
cache?
We have sites that moved to cgroups v2 and we started to observe
random job failures that are very tricky to understand and sure
such debugging is very time consuming. We can easily measure how
much memory our jobs needs (e.g. scouting jobs estimating memory
usage), but page case size is totally unpredictable to us and this
seems to make cgroups v2 memory limits pretty unusable. We would
like to have clear and simple instruction for HTCondor batch,
because otherwise enforcing memory limits become operational
nightmare with distributed infrastructure where each site invents
their own solution (or even keep killing jobs on page cache size).
Petr
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/