Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Dynamic slots
- Date: Mon, 03 Feb 2014 12:34:53 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Dynamic slots
On 1/31/2014 4:41 PM, Shrum, Donald C wrote:
I have a cluster of machines that are dedicated to HTCondor.
I've read some on dynamic slots; specifically this powerpoint:
http://research.cs.wisc.edu/htcondor/CondorWeek2012/presentations/thain-dynamic-slots.pdf
as well as this http://research.cs.wisc.edu/htcondor/manual/v8.0/3_5Policy_Configuration.html#sec:SMP-dynamicprovisioning
I've enabled whole machine jobs on our cluster. I presume if I use dynamic slots I'll do away with the configuration for whole machine jobs.
Is that the case and is using dynamic slots a better practice? Any input would be appreciated.
As with most things in life, the answer is "it depends". :)
A "whole machine" static slot configuration, as described at
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=WholeMachineSlots
has some real shortcomings. A big one is the all or nothing approach,
where the setup typically can give a job either one core or all the
cores. This can be a bummer depending on your typical job mix. If you
have, for instance, 32-core servers and a job mix that wants a
combination of 1, 8, and 16core jobs, then using dynamic slots will
likely allow much better utilization because you will be able to pack in
a lot more jobs on your server. Dynamic slots are also nice because
cores are not the only "axis" they are concerned about - maybe, for
instance, your mix of 1/8/16 core jobs are further split into large
memory vs small memory. Because dynamic slots are created "best fit"
with respect to both cores and memory (and any other server resources as
well), that could be a nice win as well. Finally, dynamic slots result
in a simpler configuration - the "whole machine" config at the above URL
is pretty complicated, both for a human to debug/tweak and also for
"condor_q -analyze" to give helpful results.
But dynamic slots have their downfalls as well. First off, the user has
to tell HTCondor what their job needs in terms of cores, memory, etc,
and what they tell HTCondor has real implications. Unfortunately, many
users simply have no idea what their jobs require or can effectively
utilize, so sometimes static slots created by a system admin that is
familiar with the workloads of the organization (esp if the cluster is
used for a repetitive/predictable workload) could be better. Also,
dynamic slots currently do not work with startd RANK policies (i.e. if
you have machines that need to prefer certain types of jobs), but we are
currently working to fix that shortcoming.
Another complication with dynamic slots is starvation. For instance, a
simple dynamic slot setup could result in multicore jobs starving
(waiting forever) if there is an infinite supply of incoming single core
jobs. The whole-machine-slots static recipe above gets around this
problem by always prioritizing whole-machine jobs; if a whole-machine
job matches, the server will then immediately "drain" out all the single
core jobs (i.e. not start any new single core jobs while waiting for
existing single core jobs to complete). This "always prioritize large
machine jobs" strategy is not ideal, and thus most whole-machine-slot
sites deal with this by only setting up some percentage of their servers
with a whole-machine-slot policy. Draining costs utilization; you will
have slots sitting around idle waiting for all the single-core jobs to
exit. But either job preemption (i.e. killing a job before it is done
and starting it over) or server draining is the price one must pay in
order to avoid starvation of larger core jobs. Dynamic slot configs are
usually setup to do draining with the help of the condor_defrag service
as described here http://goo.gl/Qh8UXu (or via some other external
application-aware service issuing the condor_drain command-line tool),
but note that the condor_defrag daemon is going to drain some percentage
of servers regardless of if there are multi-core jobs submitted or not.
The condor_defrag service also produces better results if your cluster
of servers is more homogenous, since in a very heterogeneous server pool
it is possible the condor_defrag service may drain machines that no
multicore jobs want.
All in all, the I think most sites that have switched to dynamic slots
feel it is a definite improvement (esp in increased utilization), but
there is not a clear and obvious winner in every case, and balancing our
your draining policy can be tricky. Down the road we hope to keep
enhancing HTCondor to make things easier/smarter, and to create better
tools to communicate these tradeoffs happening on a cluster more
directly to administrators.
I know the above is not a clear answer and probably more than you
wanted, but hopefully will help give an idea of the tradeoffs involved.
regards,
Todd
Thanks and have a good weekend.
Donny Shrum
FSU RCC
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685