[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] More thoughts on memory limits



Hi Christoph,

Itâs not just you.  Here, the JobRunCount is being used to find jobs that are not asking for enough memory.  In the previous major version of HTCondor, those same jobs went to a âheldâ state with a nice message saying that they exceeded their memory request.

JT


> On 5 Dec 2024, at 14:08, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
> 
> Ah,
> 
> thanks for the trip ;) 
> 
> OK so this means technically nothing has changed at the end - or should have changed. 
> 
> My testjob is relatively primitive and does use the stress tool that comes with linux. Looking into memory usage in the cgroups when I run it locally is pretty accurate
> 
> 
> [chbeyer@batch1074]~% stress --vm 1 --vm-bytes 512M
> stress: info: [3621355] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
> 
> chbeyer@batch1074]~% ps -ef | grep stress
> chbeyer  3621355 3616992  0 13:59 pts/2    00:00:00 stress --vm 1 --vm-bytes 512M
> chbeyer  3621356 3621355 97 13:59 pts/2    00:04:10 stress --vm 1 --vm-bytes 512M
> 
> [chbeyer@batch1074]~% grep -i vm /proc/3621356/status                                         
> VmPeak:  527812 kB
> VmSize:  527812 kB
> VmLck:       0 kB
> VmPin:       0 kB
> VmHWM:  524404 kB
> VmRSS:     112 kB
> VmData:  524520 kB
> VmStk:     136 kB
> VmExe:      12 kB
> VmLib:    2104 kB
> VmPTE:      40 kB
> VmSwap:       0 kB
> 
> [chbeyer@batch1074]~% ps -aux | grep stress        
> chbeyer  3621355  0.0  0.0   3520  1536 pts/2    S    13:59   0:00 stress --vm 1 --vm-bytes 512M
> chbeyer  3621356 97.8  0.1 527812 363640 pts/2   R    13:59   7:25 stress --vm 1 --vm-bytes 512M
> 
> 
> The same thing run as a condor job gives more or less random numbers ...
> 
> I might as well be wrong - some people pretend I often am ;) 
> 
> Best
> christoph
> 
> -- 
> Christoph Beyer
> DESY Hamburg
> IT-Department
> 
> Notkestr. 85
> Building 02b, Room 009
> 22607 Hamburg
> 
> phone:+49-(0)40-8998-2317
> mail: christoph.beyer@xxxxxxx
> 
> ----- UrsprÃngliche Mail -----
> Von: "Brian Bockelman" <BBockelman@xxxxxxxxxxxxx>
> An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
> Gesendet: Mittwoch, 4. Dezember 2024 15:14:44
> Betreff: Re: [HTCondor-users] More thoughts on memory limits
> 
> Hi Christoph,
> 
> From my relatively hazy memory, here's what I think the history is --
> 
> - The original design, maybe a decade ago, was to use memory.peak.
> - Using memory.peak was fairly quickly reverted because it was counting various things that individuals found surprising (such as page cache).
> - Up until 2024, the memory usage was based on the largest recorded value of memory.current which was polled every few seconds.
> - During the cgroupsv2 transition, another attempt to go to memory.peak was made (esp. as the measurements by the kernel were slightly different).
> - The second attempt at memory.peak was also reverted -- the pinch point this time was handling of processes that couldn't be killed (which are likely from prior jobs but still affecting the peak memory measurement of the current jobs).
> - So we now poll memory.current and record the peak value; this time using cgroupsv2 interfaces instead of v1.
> 
> So, what you see should today *should* be fairly close in spirit to the "max memory usage" recorded in 2023 (that is, it's approximately the maximum recorded value of memory.current polled every 5 seconds across the job lifetime).  If that's not the behavior being observed (esp. if you see MemoryUsage ever go *down*), then that's indeed a horribly surprising bug.
> 
> If you wanted to see the current memory usage of the job, we would have to add a new attribute to show that!
> 
> Hope the trip down memory lane is useful,
> 
> Brian
> 
>> On Dec 4, 2024, at 12:10âAM, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
>> 
>> Hi,
>> 
>> we definetely need the broken slot code asap as we deal endlessly with unkillable job executables. I just planned this morning to wine about it here ;) 
>> 
>> We even more deadly need the max memory usage back into the job-classadds and history - couldn't you just add a new classadd like memory.current and leave the old one as is ? 
>> 
>> Best
>> christoph 
>> 
>> -- 
>> Christoph Beyer
>> DESY Hamburg
>> IT-Department
>> 
>> Notkestr. 85
>> Building 02b, Room 009
>> 22607 Hamburg
>> 
>> phone:+49-(0)40-8998-2317
>> mail: christoph.beyer@xxxxxxx
>> 
>> ----- UrsprÃngliche Mail -----
>> Von: "Greg Thain via HTCondor-users" <htcondor-users@xxxxxxxxxxx>
>> An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
>> CC: "Greg Thain" <gthain@xxxxxxxxxxx>
>> Gesendet: Montag, 2. Dezember 2024 23:59:02
>> Betreff: Re: [HTCondor-users] More thoughts on memory limits
>> 
>> On 12/2/24 10:10 AM, Beyer, Christoph wrote:
>>> Hi,
>>> 
>>> memory.current might be interesting for someone but memory.peak could nonetheless go into another job classadd - not having access to it makes memory management pretty much impossible on many levels ?
>> 
>> 
>> Note that what happens is that HTCondor today polls the memory.current, 
>> and keeps the peak value internally, and reports that peak in the job 
>> ad.  The polling frequency is controllers by the knob 
>> STARTER_UPDATE_INTERVAL.
>> 
>> We are adding support for the notion of a "broken" slot, so that if 
>> there is an unkillable process, the slot will go into the "broken" 
>> state.  When this goes in, I think we can go back to using the 
>> cgroup.peak memory usage and reporting that.
>> 
>> 
>> -greg
>> 
>> 
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> 
>> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
>> 
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> 
>> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> 
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> 
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/