Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] out-of-memory event?

Date: Thu, 12 Oct 2017 14:07:19 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] out-of-memory event?

On 10/12/2017 12:29 PM, Michael Di Domenico wrote:

On Thu, Oct 12, 2017 at 11:11 AM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:


Assuming you are on Linux... One thing that changed between v8.4.x and
v8.6.x is in v8.6.x cgroup support is enabled by default which allows
HTCondor to more accurately track the amount of memory your job uses during
its lifetime.  On an execute node that put your job on hold, what does
   condor_config_val -dump CGROUP PREEMPT
say?  I am interested in values for CGROUP_MEMORY_LIMIT_POLICY and
BASE_CGROUP (see the Manual for details on these knobs), or if your machines
are configured to PREEMPT jobs that use more memory than provisioned in the
slot.  These settings could tell HTCondor to put jobs on hold that use more
memory than allocated in the slot.


yes this is linux, rhel 7.4 to be specific

cgroup_preempt doesn't show up

base_cgroup = htcondor
cgroup_memory_limit_policy = none

we actually set preempt to false everywhere.  we don't want jobs to
preempt for any reason unless a person specifically says to.

Hmmm... Well, given your cgroup_memory_limit_policy is none and preemptis false, my guess is that HTCondor is not killing the jobs because thejobs used more than the slot memory, but instead the operating systemitself is killing these jobs because the system as a whole is runningout of memory.

Are these execute nodes also running other services that consume memorybesides HTCondor? For instance, here at UW, we have execute nodes setupwith no swap space that are also running squid proxies, gluster clients,and cvmfs... all services that consume memory. By default, HTCondorassumes it can allocate all of the system memory into slots for use byjobs (not such a great default IMHO. But for instance, if a server with16 GB is also running the gluster client service which is using 2GB,then obviously you could run into trouble if HTCondor also tries tostart a 16GB job on the machine. Also be aware that HTCondor tells thesystem OOM killer to favor killing jobs over other processes if thesystem starts running out of memory (the idea here is we don't want theOOM to decide to kill the condor_startd instead of a job!).

So you may want to set the RESERVED_MEMORY knob in your condor configfile on your execute machines (cut-n-pasted description from the manualbelow). Here at UW we have RESERVED_MEMORY=2048 or so.


RESERVED_MEMORY

How much memory (in MB) would you like reserved from HTCondor? Bydefault, HTCondor considers all the physical memory of your machine asavailable to be used by HTCondor jobs. If RESERVED_MEMORY is defined,HTCondor subtracts it from the amount of memory it advertises as available.


Hope the above helps
Todd

References:
- [HTCondor-users] out-of-memory event?
  - From: Michael Di Domenico
- Re: [HTCondor-users] out-of-memory event?
  - From: Todd Tannenbaum
- Re: [HTCondor-users] out-of-memory event?
  - From: Michael Di Domenico

Prev by Date: [HTCondor-users] Hook job classad stdin?
Next by Date: Re: [HTCondor-users] Hook job classad stdin?
Previous by thread: Re: [HTCondor-users] out-of-memory event?
Next by thread: [HTCondor-users] Hook job classad stdin?
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] out-of-memory event?