Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] out-of-memory event?
- Date: Thu, 12 Oct 2017 10:11:47 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] out-of-memory event?
On 10/12/2017 9:17 AM, Michael Di Domenico wrote:
ever since i upgraded our condor pool from 8.4.x to 8.6.1 a lot (but
not all) of my jobs are getting put on hold with "job has encountered
an out-of-memory event".
there were a lot of condor/system changes at the same time, so it's
certainly and very possible that a config setting got changed.
the problem is i can't seem to locate which knob/knobs produces this error
our config is fairly generic and we use most of the default condor settings
Assuming you are on Linux... One thing that changed between v8.4.x and
v8.6.x is in v8.6.x cgroup support is enabled by default which allows
HTCondor to more accurately track the amount of memory your job uses
during its lifetime. On an execute node that put your job on hold, what
does
condor_config_val -dump CGROUP PREEMPT
say? I am interested in values for CGROUP_MEMORY_LIMIT_POLICY and
BASE_CGROUP (see the Manual for details on these knobs), or if your
machines are configured to PREEMPT jobs that use more memory than
provisioned in the slot. These settings could tell HTCondor to put jobs
on hold that use more memory than allocated in the slot.
So what is likely happening is your job is using more memory than
allocated to the slot, and you will need to increase the value in your
job submit file for request_memory. If you specify a log=<file> in your
submit file, the log file will state how much max memory your job used
(for jobs that completed). While these guides are tailored somewhat to
specific policies for UW-Madison users, you may find the below guide useful:
http://chtc.cs.wisc.edu/helloworld.shtml
regards,
Todd
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685