Hi Thomas,can you check, if your jobs cgroups have OOM set in their cgroup limits rather than Condor's memory watchdog?
i.e., if there is a limit set in a process' memory.limit_in_bytes e.g., at us it looks like /sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_25@xxxxxxxxxxxxxxxxx/memory.limit_in_bytes but probably your Docker set up is on a different path The path should be under the cgroup mount > mount | grep cgroup | grep memory plus a job's process sub-path from > grep memory /proc/{PID}/cgroup Cheers, Thomas On 21/06/2023 12.41, Thomas Birkett - STFC UKRI via HTCondor-users wrote:
Hi Condor Community,I have an odd issue with a small percentage of jobs we run. We have a small subset of jobs that go on hold due to resource being exceeded, for example:LastHoldReason = "Error from slot1_38@xxxxxxxxxxxxxxxxxxxxxxx <mailto:slot1_38@xxxxxxxxxxxxxxxxxxxxxxx>: Docker job has gone over memory limit of 4100 Mb"However, we havenât configured any resource limits to hold jobs. I also notice the only ClassAd that appears to match the memory limit is:MemoryProvisioned = 4100These jobs are then removed by a SYSTEM_PERIODIC_REMOVE statement to clear down held jobs. My question to the community is why is the job going on hold in the first place? The only configured removal limit / PeriodicRemove statement we configure is on a per job level shown below:PeriodicRemove = (JobStatus == 1 && NumJobStarts > 0) || ((ResidentSetSize =!= undefined ? ResidentSetSize : 0) > JobMemoryLimit)I cannot replicate this behaviour in my testing, and I cannot find any reason why the job went on hold.Researching the relevant classads, I see: MemoryProvisionedThe amount of memory in MiB allocated to the job. With statically-allocated slots, it is the amount of memory space allocated to the slot. With dynamically-allocated slots, it is based upon the job attribute RequestMemory, but may be larger due to the minimum given to a dynamic slot.At our site we dynamically assign our slots and the Request memory for this job is âRequestMemory = 4096â. I find this even more perplexing as this is a very rare issue with over 90% of the jobs working well and completing, same job type, same VO, same config. Any assistance debugging this issue will be gratefully received.Many thanks, *Thomas Birkett* Senior Systems Administrator Scientific Computing Department Science and Technology Facilities Council (STFC) Rutherford Appleton Laboratory, Chilton, Didcot OX11 0QX signature_609518872 _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature