Hello,
Having had many times worker nodes hanging because of memory exhaustion,
I am trying to figure out how we can prevent this. I believe the memory
exhaustion is due to some kind of pathologic job using way more memory
than it should.
The first question would be : does it make sense to use
SYSTEM_PERIODIC_REMOVE in the config of a worker node (startd) or is it
working only on the scheduler (thus reacting with a certain delay) ?
Then, I tried differents settings of CGROUP_MEMORY_LIMIT_POLICY.
I understand that the default setting is : "none". In this case, in
/sys/fs/cgroup/memory/htcondor/condor_dlocal_htcondor_slot1\@worker,
"memory.limit_in_bytes" is set to the nodes detected memory divided by
the number of cores and "memory.soft_limit_in_bytes" is 0.
I tried setting CGROUP_MEMORY_LIMIT_POLICY to "soft". It seems to do its
job with jobs being remove with "Job has gone over memory limit of 6000
megabytes. Peak usage: 5926 megabytes." BUT: The result on the worker
nodes is a number of processes in "Deffered" status which gives a high
Unix load even if there is no CPU consumed. No new jobs are scheduled.
Looks like the jobs are not killed cleanly.
I am now trying with "hard". Let's see...
I have read this presentation :
https://research.cs.wisc.edu/htcondor/HTCondorWeek2017/presentations/WedDownes_cgroups.pdf
... but I do not understand everything. Sorry.
This is HTCondor version 8.6.13. Also, please note that I have made
is so that the threshold is higher than the detected memory :
MEMORY = 1.5 * quantize( $(DETECTED_MEMORY), 1000 )
MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory,100)
Thank you in advance.
JM
--
------------------------------------------------------------------------
Jean-michel BARBETÂ Â Â Â Â Â Â Â Â Â | Tel: +33 (0)2 51 85 84 86
Laboratoire SUBATECH Nantes France  | Fax: +33 (0)2 51 85 84 79
CNRS-IN2P3/Ecole des Mines/Universite | E-Mail: barbet@xxxxxxxxxxxxxxxxx
------------------------------------------------------------------------
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/