| Mailing List ArchivesAuthenticated access |  | ![[Computer Systems Lab]](http://www.cs.wisc.edu/pics/csl_logo.gif)  | 
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] htcondor cgroups and memory limits on CentOS7
- Date: Mon, 23 Oct 2017 17:43:49 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] htcondor cgroups and memory limits on CentOS7
On 10/23/2017 10:07 AM, Alessandra Forti wrote:
1) Why SYSTEM_PERIODIC_REMOVEÂ didn't work? 
Because the (system_)periodic_remove expressions are evaluated by the 
condor_shadow while the job is running, and the *_RAW attributes are 
only updated in the condor_schedd.
A simple solution is to use attribute MemoryUsage instead of 
ResidentSetSize_RAW. So I think things will work as you want if you 
instead did:
 RemoveMemoryUsage = ( MemoryUsage > 2*RequestMemory )
 SYSTEM_PERIODIC_REMOVE = $(RemoveMemoryUsage) || <OtherParameters>
let me get this straight if I replace ResidentSetSize_RAW with 
MemoryUsage it should work?
Yes, that is correct, with MemoryUsage it should work.
Also you may want to place your memory limit policy on the execute 
nodes via startd policy expression, instead of having them enforced on 
the submit machine (what I think you are calling the head node). The 
reason is the execute node policy is evaluated every five seconds, 
while the submit machine policy is evaluated every several minutes. 
I read that the submit machine evaluates the expression every 60 seconds 
since version 7.4 (though admitedly the blog I read is quite old so 
things might have changed again 
(https://spinningmatt.wordpress.com/2009/12/05/cap-job-runtime-debugging-periodic-job-policy-in-a-condor-pool/) 
But realize that there is a lot of polling going on here.  The 
condor_starter on the execute machine (worker node) will poll the 
operating system for the resource utilization of the job, and send 
updated job attributes like MemoryUsage to both the condor_startd and 
the condor_shadow every STARTER_UPDATE_INTERVAL seconds (300secs by 
default).  Then, for a running job, the condor_shadow will evaluate your 
SYSTEM_PERIODIC_REMOVE expression every PERIODIC_EXPR_INTERVAL seconds 
(60 by default).  The condor_shadow will also push updated job 
attributes up to the condor_schedd every SHADOW_QUEUE_UPDATE_INTERVAL 
seconds (900secs by default).
The above polling/update parameters are set how they are by default to 
limit the update rates to accommodate one schedd managing many thousands 
of live jobs.
So... given the above, note the default config means your 
SYSTEM_PERIODIC_REMOVE expression could take up to 5 or 6 minutes before 
it removes a large memory job.  And if you are monitoring MemoryUsage 
and/or ResidentSetSize job attributes via condor_q, it will take 15 
minutes (up to 20 minutes) for condor_q to show a MemoryUsage spike.
A runaway job could consume a lot of memory in a few minutes :).
Do you mean I should move SYSTEM_PERIODIC_REMOVE to the WN? or is there 
another recipe? 
Yes, if you have control over the config of the worker node, it may be 
better to configure the worker node to simply kill a job that exceeds 
your memory policy instead of waiting for the memory usage information 
to propagate back to the submit node.  The worker node would (by 
default) kill the job either immediately if using cgroups with the hard 
memory policy, or within 5 seconds if you want a custom PREEMPT 
expression that could state things like only kill if the job is using 2x 
the provisioned memory (still don't understand why you want to allow the 
job to use twice the memory it requested...).
Hope the above helps,
Todd
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685