Re: [HTCondor-users] More thoughts on memory limits

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

11/21/24 04:36:21 (pid:3493459) Process pid 3493687 was OOM killed

11/21/24 04:36:21 (pid:3493459) Process exited, pid=3493687, signal=9

11/21/24 04:36:21 (pid:3493459) Job was held due to OOM event: Job has gone over cgroup memory limit of 16384 megabytes. Peak usage: 16384 megabytes. Consider resubmitting with a higher request_memory.

After:

11/27/24 18:55:58 (pid:89032) Process pid 90492 was OOM killed

11/27/24 18:55:58 (pid:89032) Process exited, pid=90492, signal=9

11/27/24 18:55:58 (pid:89032) Evicting job because system is out of memory, even though the job is below requested memory: Usage is 59 Mb limit is 17179869184

The eviction does not result in a held job, so the job is just restarted. I killed a job earlier today with a restart count of above 40.

On 4 Dec 2024, at 11:18, Jeff Templon <templon@xxxxxxxxx> wrote:

Hi,

Iâm wondering if this change has effectively disabled the memory limits. I donât see any more held jobs with messages about memory exceeded, I do see job restarts with messages about âexhausted memory on worker nodeâ.

JT

On 4 Dec 2024, at 07:10, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:

Hi,

we definetely need the broken slot code asap as we deal endlessly with unkillable job executables. I just planned this morning to wine about it here ;)

We even more deadly need the max memory usage back into the job-classadds and history - couldn't you just add a new classadd like memory.current and leave the old one as is ?

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- UrsprÃngliche Mail -----
Von: "Greg Thain via HTCondor-users" <htcondor-users@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
CC: "Greg Thain" <gthain@xxxxxxxxxxx>
Gesendet: Montag, 2. Dezember 2024 23:59:02
Betreff: Re: [HTCondor-users] More thoughts on memory limits

On 12/2/24 10:10 AM, Beyer, Christoph wrote:
Hi,

memory.current might be interesting for someone but memory.peak could nonetheless go into another job classadd - not having access to it makes memory management pretty much impossible on many levels ?

Note that what happens is that HTCondor today polls the memory.current,
and keeps the peak value internally, and reports that peak in the job
ad. The polling frequency is controllers by the knob
STARTER_UPDATE_INTERVAL.

We are adding support for the notion of a "broken" slot, so that if
there is an unkillable process, the slot will go into the "broken"
state. When this goes in, I think we can go back to using the
cgroup.peak memory usage and reporting that.

-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] More thoughts on memory limits