| Mailing List ArchivesAuthenticated access |  | ![[Computer Systems Lab]](http://www.cs.wisc.edu/pics/csl_logo.gif)  | 
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] htcondor cgroups and memory limits on CentOS7
- Date: Fri, 20 Oct 2017 11:26:48 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] htcondor cgroups and memory limits on CentOS7
On 10/20/2017 9:44 AM, Alessandra Forti wrote:
Hi,
is more information needed?
Hi Alessandra,
The version of HTCondor you are using would be helpful :).
But I have some answers/suggestions below that I hope will help...
* On the head node
RemoveMemoryUsage = ( ResidentSetSize_RAW > 2000*RequestMemory )
SYSTEM_PERIODIC_REMOVE = $(RemoveMemoryUsage)Â || <OtherParameters>
So the questions are two
1) Why SYSTEM_PERIODIC_REMOVEÂ didn't work? 
Because the (system_)periodic_remove expressions are evaluated by the 
condor_shadow while the job is running, and the *_RAW attributes are 
only updated in the condor_schedd.
A simple solution is to use attribute MemoryUsage instead of 
ResidentSetSize_RAW.  So I think things will work as you want if you 
instead did:
  RemoveMemoryUsage = ( MemoryUsage > 2*RequestMemory )
  SYSTEM_PERIODIC_REMOVE = $(RemoveMemoryUsage)  || <OtherParameters>
Note that MemoryUsage is in the same units as RequestMemory, so only 
need to multiply by 2 instead of 2000.
You are not the first person to be tripped up by this. :(  I realize it 
is not at all intuitive. I think I will add a quick patch in the code to 
allow _RAW attributes to be referenced inside of job policy expressions 
to help prevent frustration by the next person.
Also you may want to place your memory limit policy on the execute nodes 
via startd policy expression, instead of having them enforced on the 
submit machine (what I think you are calling the head node).  The reason 
is the execute node policy is evaluated every five seconds, while the 
submit machine policy is evaluated every several minutes.  A runaway job 
could consume a lot of memory in a few minutes :).
2) Shouldn't htcondor set the job soft limit with this configuration? 
or is the site expected to set the soft limit separately?
Personally, I think "soft" limits in cgroups are completely bogus.  The 
way the Linux kernel treats soft limits does not do in practice what 
anyone (including htcondor itself) expects.  I recommend settings 
CGROUP_MEMORY_LIMIT to either none or hard, soft makes no sense imho.
"CGROUP_MEMORY_LIMIT=hard" is clear to understand: if the job uses more 
memory than it requested, it is __immediately__ kicked off and put on 
hold.  This way users get a consistent experience.
If you want jobs to be able to go over their requested memory so long as 
the machine isn't swapping, consider disabling swap on your execute 
nodes (not a bad idea for compute servers in general) and simply leaving 
"CGROUP_MEMORY_LIMIT=none".  What will happen is if the system is 
stressed, eventually the Linux OOM (out of memory killer) will kick in 
and pick a process to kill.  HTCondor sets the OOM priority of job 
process such that the OOM killer should always pick job processes ahead 
of other processes on the system.  Furthermore, HTCondor "captures" the 
OOM request to kill a job and only allows it to continue if the job is 
indeed using more memory than requested (i.e. provisioned in the slot). 
This is probably what you wanted by setting the limit to soft in the 
first place.
I am thinking we should remove the "soft" option to CGROUP_MEMORY_LIMIT 
in future releases, it just causes confusion imho.  Curious if others on 
the list disagree...
Hope the above helps,
regards,
Todd