Before upgrade to latest version:
11/21/24 04:36:21 (pid:3493459) Process pid 3493687 was OOM killed 11/21/24 04:36:21 (pid:3493459) Process exited, pid=3493687, signal=9 11/21/24 04:36:21 (pid:3493459) Job was held due to OOM event: Job has gone over cgroup memory limit of 16384 megabytes. Peak usage: 16384 megabytes. Consider resubmitting with a higher request_memory.
After:
11/27/24 18:55:58 (pid:89032) Process pid 90492 was OOM killed 11/27/24 18:55:58 (pid:89032) Process exited, pid=90492, signal=9 11/27/24 18:55:58 (pid:89032) Evicting job because system is out of memory, even though the job is below requested memory: Usage is 59 Mb limit is 17179869184
The eviction does not result in a held job, so the job is just restarted. I killed a job earlier today with a restart count of above 40.
JT
On 4 Dec 2024, at 11:18, Jeff Templon <templon@xxxxxxxxx> wrote:
Hi, Iâm wondering if this change has effectively disabled the memory limits. I donât see any more held jobs with messages about memory exceeded, I do see job restarts with messages about âexhausted memory on worker nodeâ. JT On 4 Dec 2024, at 07:10, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
Hi,
we definetely need the broken slot code asap as we deal endlessly with unkillable job executables. I just planned this morning to wine about it here ;)
We even more deadly need the max memory usage back into the job-classadds and history - couldn't you just add a new classadd like memory.current and leave the old one as is ?
Best christoph
-- Christoph Beyer DESY Hamburg IT-Department
Notkestr. 85 Building 02b, Room 009 22607 Hamburg
phone:+49-(0)40-8998-2317 mail: christoph.beyer@xxxxxxx
----- UrsprÃngliche Mail ----- Von: "Greg Thain via HTCondor-users" <htcondor-users@xxxxxxxxxxx> An: "htcondor-users" <htcondor-users@xxxxxxxxxxx> CC: "Greg Thain" <gthain@xxxxxxxxxxx> Gesendet: Montag, 2. Dezember 2024 23:59:02 Betreff: Re: [HTCondor-users] More thoughts on memory limits
On 12/2/24 10:10 AM, Beyer, Christoph wrote:
Hi,
memory.current might be interesting for someone but memory.peak could nonetheless go into another job classadd - not having access to it makes memory management pretty much impossible on many levels ?
Note that what happens is that HTCondor today polls the memory.current, and keeps the peak value internally, and reports that peak in the job ad. The polling frequency is controllers by the knob STARTER_UPDATE_INTERVAL.
We are adding support for the notion of a "broken" slot, so that if there is an unkillable process, the slot will go into the "broken" state. When this goes in, I think we can go back to using the cgroup.peak memory usage and reporting that.
-greg
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
|