Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] htcondor cgroups and memory limits on CentOS7
- Date: Mon, 23 Oct 2017 17:43:49 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] htcondor cgroups and memory limits on CentOS7
On 10/23/2017 10:07 AM, Alessandra Forti wrote:
1) Why SYSTEM_PERIODIC_REMOVEÂ didn't work?
Because the (system_)periodic_remove expressions are evaluated by the
condor_shadow while the job is running, and the *_RAW attributes are
only updated in the condor_schedd.
A simple solution is to use attribute MemoryUsage instead of
ResidentSetSize_RAW. So I think things will work as you want if you
instead did:
 RemoveMemoryUsage = ( MemoryUsage > 2*RequestMemory )
 SYSTEM_PERIODIC_REMOVE = $(RemoveMemoryUsage) || <OtherParameters>
let me get this straight if I replace ResidentSetSize_RAW with
MemoryUsage it should work?
Yes, that is correct, with MemoryUsage it should work.
Also you may want to place your memory limit policy on the execute
nodes via startd policy expression, instead of having them enforced on
the submit machine (what I think you are calling the head node). The
reason is the execute node policy is evaluated every five seconds,
while the submit machine policy is evaluated every several minutes.
I read that the submit machine evaluates the expression every 60 seconds
since version 7.4 (though admitedly the blog I read is quite old so
things might have changed again
(https://spinningmatt.wordpress.com/2009/12/05/cap-job-runtime-debugging-periodic-job-policy-in-a-condor-pool/)
But realize that there is a lot of polling going on here. The
condor_starter on the execute machine (worker node) will poll the
operating system for the resource utilization of the job, and send
updated job attributes like MemoryUsage to both the condor_startd and
the condor_shadow every STARTER_UPDATE_INTERVAL seconds (300secs by
default). Then, for a running job, the condor_shadow will evaluate your
SYSTEM_PERIODIC_REMOVE expression every PERIODIC_EXPR_INTERVAL seconds
(60 by default). The condor_shadow will also push updated job
attributes up to the condor_schedd every SHADOW_QUEUE_UPDATE_INTERVAL
seconds (900secs by default).
The above polling/update parameters are set how they are by default to
limit the update rates to accommodate one schedd managing many thousands
of live jobs.
So... given the above, note the default config means your
SYSTEM_PERIODIC_REMOVE expression could take up to 5 or 6 minutes before
it removes a large memory job. And if you are monitoring MemoryUsage
and/or ResidentSetSize job attributes via condor_q, it will take 15
minutes (up to 20 minutes) for condor_q to show a MemoryUsage spike.
A runaway job could consume a lot of memory in a few minutes :).
Do you mean I should move SYSTEM_PERIODIC_REMOVE to the WN? or is there
another recipe?
Yes, if you have control over the config of the worker node, it may be
better to configure the worker node to simply kill a job that exceeds
your memory policy instead of waiting for the memory usage information
to propagate back to the submit node. The worker node would (by
default) kill the job either immediately if using cgroups with the hard
memory policy, or within 5 seconds if you want a custom PREEMPT
expression that could state things like only kill if the job is using 2x
the provisioned memory (still don't understand why you want to allow the
job to use twice the memory it requested...).
Hope the above helps,
Todd
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685