[HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT

[HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Tue, 26 Dec 2023 11:36:51 -0500

Subject: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none

HTCondor community,

In an HTCondor 23 LTS executor setup, all worker servers are the sameÂand CGROUP_MEMORY_LIMIT_POLICY is set to none. Each worker server has a sizable swap partition. Jobs were randomly held because of exceeding memory limits.Â

Job 4882.11349 going into Hold state (code 34,0): Error from slot1_1@workerX: Job has gone over memory limit of 128 megabytes. Peak usage: 4346 megabytes.

The worker server log showed that Linux OOM killer terminated the job. Worker nodes are preemptible instances in a public cloud. So it is not easy (but doable) to collect actual memory usage for each worker.

Can someone please advise if OOM was because the server memory (physicalÂ+ swap) used up, or another HTCondor knob killed the job because it used more memory than requested?

Thank you and happy holidays!

JM.

Mailing List Archives

Authenticated access

[HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none