Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] More thoughts on memory limits
- Date: Thu, 12 Dec 2024 13:19:02 +0100 (CET)
- From: "Beyer, Christoph" <christoph.beyer@xxxxxxx>
- Subject: Re: [HTCondor-users] More thoughts on memory limits
Hi Brian et al,
sorryfor the slight delay, I made some more tests on the memory issue - here is what I do:
- start a steady memory consumption job (stress binary, consumes 1gb of mem roughly)
- on the worker I read the memory consumption from /proc (Vmsize) and CGROUP (memory.current) in a 10 sec interval
As you can see the memory.current is relatively useless/oscillating wildly:
It seems to me there goes a lot more dynamic into it than we need (?)
Thu Dec 12 01:16:34 PM CET 2024 PROC: VmSize: 1027524 kB CGRP: 788660224
Thu Dec 12 01:16:44 PM CET 2024 PROC: VmSize: 1027524 kB CGRP: 412344320
Thu Dec 12 01:16:54 PM CET 2024 PROC: VmSize: 1027524 kB CGRP: 7049216
Thu Dec 12 01:17:04 PM CET 2024 PROC: VmSize: 1027524 kB CGRP: 452481024
Thu Dec 12 01:17:14 PM CET 2024 PROC: VmSize: 1027524 kB CGRP: 496685056
Thu Dec 12 01:17:24 PM CET 2024 PROC: VmSize: 1027524 kB CGRP: 752803840
Thu Dec 12 01:17:34 PM CET 2024 PROC: VmSize: 1027524 kB CGRP: 702517248
Thu Dec 12 01:17:44 PM CET 2024 PROC: VmSize: 1027524 kB CGRP: 702472192
Thu Dec 12 01:17:54 PM CET 2024 PROC: VmSize: 1027524 kB CGRP: 1042812928
Thu Dec 12 01:18:04 PM CET 2024 PROC: VmSize: 1027524 kB CGRP: 206536704
Thu Dec 12 01:18:14 PM CET 2024 PROC: VmSize: 1027524 kB CGRP: 91041792
Best
christoph
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
----- UrsprÃngliche Mail -----
Von: "Brian Bockelman" <BBockelman@xxxxxxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 4. Dezember 2024 15:14:44
Betreff: Re: [HTCondor-users] More thoughts on memory limits
Hi Christoph,
>From my relatively hazy memory, here's what I think the history is --
- The original design, maybe a decade ago, was to use memory.peak.
- Using memory.peak was fairly quickly reverted because it was counting various things that individuals found surprising (such as page cache).
- Up until 2024, the memory usage was based on the largest recorded value of memory.current which was polled every few seconds.
- During the cgroupsv2 transition, another attempt to go to memory.peak was made (esp. as the measurements by the kernel were slightly different).
- The second attempt at memory.peak was also reverted -- the pinch point this time was handling of processes that couldn't be killed (which are likely from prior jobs but still affecting the peak memory measurement of the current jobs).
- So we now poll memory.current and record the peak value; this time using cgroupsv2 interfaces instead of v1.
So, what you see should today *should* be fairly close in spirit to the "max memory usage" recorded in 2023 (that is, it's approximately the maximum recorded value of memory.current polled every 5 seconds across the job lifetime). If that's not the behavior being observed (esp. if you see MemoryUsage ever go *down*), then that's indeed a horribly surprising bug.
If you wanted to see the current memory usage of the job, we would have to add a new attribute to show that!
Hope the trip down memory lane is useful,
Brian
> On Dec 4, 2024, at 12:10âAM, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
>
> Hi,
>
> we definetely need the broken slot code asap as we deal endlessly with unkillable job executables. I just planned this morning to wine about it here ;)
>
> We even more deadly need the max memory usage back into the job-classadds and history - couldn't you just add a new classadd like memory.current and leave the old one as is ?
>
> Best
> christoph
>
> --
> Christoph Beyer
> DESY Hamburg
> IT-Department
>
> Notkestr. 85
> Building 02b, Room 009
> 22607 Hamburg
>
> phone:+49-(0)40-8998-2317
> mail: christoph.beyer@xxxxxxx
>
> ----- UrsprÃngliche Mail -----
> Von: "Greg Thain via HTCondor-users" <htcondor-users@xxxxxxxxxxx>
> An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
> CC: "Greg Thain" <gthain@xxxxxxxxxxx>
> Gesendet: Montag, 2. Dezember 2024 23:59:02
> Betreff: Re: [HTCondor-users] More thoughts on memory limits
>
> On 12/2/24 10:10 AM, Beyer, Christoph wrote:
>> Hi,
>>
>> memory.current might be interesting for someone but memory.peak could nonetheless go into another job classadd - not having access to it makes memory management pretty much impossible on many levels ?
>
>
> Note that what happens is that HTCondor today polls the memory.current,
> and keeps the peak value internally, and reports that peak in the job
> ad. The polling frequency is controllers by the knob
> STARTER_UPDATE_INTERVAL.
>
> We are adding support for the notion of a "broken" slot, so that if
> there is an unkillable process, the slot will go into the "broken"
> state. When this goes in, I think we can go back to using the
> cgroup.peak memory usage and reporting that.
>
>
> -greg
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/