Having used partitionable slots since the first installation
of our pool back in February 2013, I can jump in here with some useful
information for you.
Our setup is a basic partitionable slot config in
which there's only one type-1 slot which advertises 100% of cpu, memory,
and disk. We used it on every member of the pool until recently, when I
ran into trouble getting parallel universe jobs to cooperate with partitionable
slots and so I set up a handful of machines with static slots to support
MATLAB Parallel Computing Toolbox MPI communicating jobs for distributed
arrays, parfor, and the like.
All the pool members are RHEL6 systems, so they use
the cgroups system for resource tracking, which is quite spiffy.
We had to go with partitionable slots because of the
wide array of jobs that needed to be run were so diverse. Some jobs needed
500MB of memory, and others might need 20GB, depending on the nature of
the scenario. Some of the more fancy jobs needed multiple cpus, and some
of the continuous integration build scripts ran "make -j 8" for
8 compile threads. We've even got some jobs using GPUs, again with a wide
range of memory requirements. We also have stacks of MATLAB jobs, under
a few different versions of MATLAB, again with a wide range of memory requirements.
Partitionable slots was really the only choice for us.
"HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx>
wrote on 09/04/2015 03:05:37 AM:
> From: Mathieu Bahin <mathieu.bahin@xxxxxxxxxxxxxxx> > To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> > Date: 09/04/2015 03:06 AM > Subject: Re: [HTCondor-users] Priority calculation:
memory > Sent by: "HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx> >
> Thanks Greg for this quick and precise answer, maybe we won't take
the
> risk to adjust that then.
>
> Actually, we wonder how things will be with the partitionable slots.
> From what we understand:
> - a default max memory is allocated to the job if nothing special
is
> specified
> - if the job exceed this memory, the job is aborted
By default, jobs exceeding the memory request are
not aborted, you need to write a periodic hold _expression_ to do that. Page
211 of the 8.2.9 manual shows an example of how to do it. The cgroup_memory_limit_policy
governs how memory allocations are handled, but doesn't impact eviction
of jobs - see page 243 of the 8.2.9 manual. Docker universe jobs in 8.3,
and the upcoming 8.4, do have an lethal electric fence for memory allocations,
however.
In RHEL6 and up, the kernel has an "out of memory
killer" which protects the system from crashing when it exhausts all
physical memory, so that's the main defense of the system we use in our
pools. Due to the OOM killer's "principle of least surprise,"
it's always going to be the overly bloated process which gets nailed, and
it shows up in the syslog so it's easy to diagnose. Under RHEL5 without
the OOM killer, the system would either panic or thrash in swap space,
so it was virtually impossible to figure out what went wrong.
Since we were migrating from Grid Engine, it would
have been far too disruptive to kill jobs exceeding our default 1GB memory
allocation, because at the outset virtually none of the jobs or the users
submitting them had any idea of how much memory the jobs needed. It was
pretty routine for Grid Engine to fire up 24 jobs on 24 cores which needed
10GB each, on a system with 48GB of physical memory, and it took me two
weeks to figure out how to configure it to treat memory as a consumable
resource. Occasionally a job tried to allocate dozens or hundreds of TERAbytes
of memory by allocating an array based on dimensions in uninitialized variables,
and it would dutifully suck down gigabyte after gigabyte of physical memory
for several minutes until the machine hung or crashed, and nobody could
figure out that root cause until HTCondor, cgroups, and OOM killer came
along.
Now that more and more users are getting the hang
of everything, the memory requests are much more accurate and the partitionable
slots work like a charm.
One interesting thing you can do is set up your RequestMemory
_expression_ to vary based on NumJobStarts - if a job gets OOM-killed, when
it restarts its NumJobStarts will be greater than one, and you can use
a ClassAd expresson to increase the amount of memory requested by the job
the second time it runs.
Adjusting slot weight isn't "risky" in the
traditional sense - the issue is that you need to come up with a mathematical
_expression_ that arrives at a fair assessment of the user's resource use
that works across a range of values. For instance, someone who uses a single
CPU and 32GB of memory on a machine which has 512GB of memory is not having
the same kind of impact on pending jobs as someone who's using 32 CPUs
with 1GB of memory each on six different machines. So in order to assess
the utilization of those two users fairly, it might to be difficult to
get it just right. I've never changed the slot_weight in any of our pools,
and it's been rare to encounter situations where I thought about doing
it.
I could imagine doing something based on the amount
of physical memory per available CPU core - detected memory divided by
detected cores, and then charge someone for two slot weights if they use
one CPU core but two cores worth of memory. You'd probably want to use
NUM_CPUS rather than detected cores, since not all cores may be advertised,
and with reserved_memory not all memory might be either, i.e., memory_per_core
= ( $(DETECTED_MEMORY) - $(RESERVED_MEMORY) ) / $(NUM_CPUS) -- but then
if you're advertising more cores than are available such as for Todd's
suspendable-slot example, you'd have to finagle that further. Would you
want to charge them two if they only used one and a half cpu's worth of
memory? Etc, etc...
> The cluster is composed of machines with very different caracteristics
> (memory from [8G, 8 cores] to [192G, 16 cores]) so it's not easy to
> setup a default memory.
It's not so much the characteristics of the machines,
but the characteristics of the typical jobs, which guide the choice of
a default memory size.
We went with 1GB, since majority of the jobs run about
500-1000 megabytes of physical memory. Even though most of the machines
had 4GB per CPU core, the 1GB default carved out enough of an allocation
that it limited the impact of a job or two ballooning unexpectedly - on
a 24-core/96GB machine if you've got 23 jobs behaving well and one job
ballooning, the 24th job could grow up to 70GB of physical memory before
disturbing the OOM killer and running the risk of execution. Remember,
the default memory request is not a fatal barrier to jobs by default.
As you get the hang of things, and look at the UserLog
files for the "requested / used" numbers reported at job completion,
you'll be able to help your users dial in the right number for their memory
requests.
For the 8G/8-core machine, a 1GB limit would never
match the 8th slot, because due to overhead the pslot won't advertise a
full 8096 MB, so you may want to go with 750MB instead. If it's a desktop
system, however, you'll want to leave that 8th slot unoccupied anyway,
since I've found desktop machines can get a bit thrashed with every core
running a CPU-intensive job, even at nice -19.
> What we are afraid of is that users, tired with having jobs aborted,
> always request a very large amount of memory for their jobs.
>
> Have we misunderstood something? Do you have some advice about that?
I hope the above is just what you need.
It's interesting to watch the graph of claimed/busy
slots when jobs which request larger amounts of memory are queued up -
the systems with 24 cores and 96GB only get 12 8GB jobs, thus half the
cores are idle so the graph drops off, but the pool is still fully utilized.