Hi Jean-Michel,just an idea - but can you try and check, if the out-of-memory control is handled by the kernel or by Condor?
As far as I understand [1], with something like> cat /sys/fs/cgroup/memory/system.slice/condor.service/SLOT/memory.oom_control
oom_kill_disable 1 under_oom 0should indicate, that the kernel itself is not killing or stopping processes (but might depend also on the parent oom settigns maybe??)
Cheers, Thomas [1] https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt On 24/04/2020 08.43, Jean-Michel Barbet wrote:
Hello, Having had many times worker nodes hanging because of memory exhaustion, I am trying to figure out how we can prevent this. I believe the memory exhaustion is due to some kind of pathologic job using way more memory than it should. The first question would be : does it make sense to use SYSTEM_PERIODIC_REMOVE in the config of a worker node (startd) or is it working only on the scheduler (thus reacting with a certain delay) ? Then, I tried differents settings of CGROUP_MEMORY_LIMIT_POLICY. I understand that the default setting is : "none". In this case, in /sys/fs/cgroup/memory/htcondor/condor_dlocal_htcondor_slot1\@worker, "memory.limit_in_bytes" is set to the nodes detected memory divided by the number of cores and "memory.soft_limit_in_bytes" is 0. I tried setting CGROUP_MEMORY_LIMIT_POLICY to "soft". It seems to do itsjob with jobs being remove with "Job has gone over memory limit of 6000 megabytes. Peak usage: 5926 megabytes." BUT: The result on the workernodes is a number of processes in "Deffered" status which gives a high Unix load even if there is no CPU consumed. No new jobs are scheduled. Looks like the jobs are not killed cleanly. I am now trying with "hard". Let's see... I have read this presentation :https://research.cs.wisc.edu/htcondor/HTCondorWeek2017/presentations/WedDownes_cgroups.pdf... but I do not understand everything. Sorry. This is HTCondor version 8.6.13. Also, please note that I have made is so that the threshold is higher than the detected memory : MEMORY = 1.5 * quantize( $(DETECTED_MEMORY), 1000 ) MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory,100) Thank you in advance. JM
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature