[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Job breaching memory in condor version 24.0.1 not getting held



Hello Experts,

A simple python code to allocate 400G of virtual memory then touch the pages to increase RSS memory in loop, I am expecting the job should get held once it breaches the slot memory of 20GB, it gets killed as soon as it breaches the memory, switching state from running to idle then start running again.Â

Following messages reported in slot log: 85Mb is misleading value.Â

11/12/24 16:32:47 (pid:915626) Process pid 916134 was OOM killed
11/12/24 16:32:47 (pid:915626) Process exited, pid=916134, signal=9
11/12/24 16:32:47 (pid:915626) Evicting job because system is out of memory, even though the job is below requested memory: Usage is 85 Mb limit is 21026045952
11/12/24 16:32:47 (pid:915626) All jobs have exited... starter exiting


I know periodicremove condition can be set in job definition, I want to know what is the expected behavior when job breaches the allocated memory with following settingsÂ

BASE_CGROUP = htcondor
CGROUP_IGNORE_CACHE_MEMORY = true
CGROUP_MEMORY_LIMIT_POLICY = hard

Earlier we used the following setting on Centos 7 and Rocky 8 machines to put the job breaching memory into held status. Now on Rocky9 with condor 24.0.1 this setting doesn't make a difference.Â

IGNORE_LEAF_OOM = False

If a job is not getting held, at-least it should get removed from the queue.Â

I see the following in cgroup output as soon as job breaches memory

low 0
high 0
max 18
oom 1
oom_kill 5. <<<<<Â



Thanks & Regards,
Vikrant Aggarwal