HTCondor community,
In an HTCondor 23 LTS executor setup, all worker servers are the sameÂand CGROUP_MEMORY_LIMIT_POLICY is set to none. Each worker server has a sizable swap partition. Jobs were randomly held because of exceeding memory limits.Â
Job 4882.11349 going into Hold state (code 34,0): Error from slot1_1@workerX: Job has gone over memory limit of 128 megabytes. Peak usage: 4346 megabytes.
The worker server log showed that Linux OOM killer terminated the job. Worker nodes are preemptible instances in a public cloud. So it is not easy (but doable) to collect actual memory usage for each worker.
Can someone please advise if OOM was because the server memory (physicalÂ+ swap) used up, or another HTCondor knob killed the job because it used more memory than requested?
Thank you and happy holidays!
JM.