[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none



HTCondor community,

In an HTCondor 23 LTS executor setup, all worker servers are the sameÂand CGROUP_MEMORY_LIMIT_POLICY is set to none. Each worker server has a sizable swap partition. Jobs were randomly held because of exceeding memory limits.Â

Job 4882.11349 going into Hold state (code 34,0): Error from slot1_1@workerX: Job has gone over memory limit of 128 megabytes. Peak usage: 4346 megabytes.

The worker server log showed that Linux OOM killer terminated the job. Worker nodes are preemptible instances in a public cloud. So it is not easy (but doable) to collect actual memory usage for each worker.

Can someone please advise if OOM was because the server memory (physicalÂ+ swap) used up, or another HTCondor knob killed the job because it used more memory than requested?

Thank you and happy holidays!

JM.