[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dynamic slots



On 1/31/2014 4:41 PM, Shrum, Donald C wrote:
I have a cluster of machines that are dedicated to HTCondor.

I've read some on dynamic slots; specifically this powerpoint:
http://research.cs.wisc.edu/htcondor/CondorWeek2012/presentations/thain-dynamic-slots.pdf

as well as this http://research.cs.wisc.edu/htcondor/manual/v8.0/3_5Policy_Configuration.html#sec:SMP-dynamicprovisioning

I've enabled whole machine jobs on our cluster.  I presume if I use dynamic slots I'll do away with the configuration for whole machine jobs.

Is that the case and is using dynamic slots a better practice?   Any input would be appreciated.

As with most things in life, the answer is "it depends". :)

A "whole machine" static slot configuration, as described at
 https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=WholeMachineSlots
has some real shortcomings. A big one is the all or nothing approach, where the setup typically can give a job either one core or all the cores. This can be a bummer depending on your typical job mix. If you have, for instance, 32-core servers and a job mix that wants a combination of 1, 8, and 16core jobs, then using dynamic slots will likely allow much better utilization because you will be able to pack in a lot more jobs on your server. Dynamic slots are also nice because cores are not the only "axis" they are concerned about - maybe, for instance, your mix of 1/8/16 core jobs are further split into large memory vs small memory. Because dynamic slots are created "best fit" with respect to both cores and memory (and any other server resources as well), that could be a nice win as well. Finally, dynamic slots result in a simpler configuration - the "whole machine" config at the above URL is pretty complicated, both for a human to debug/tweak and also for "condor_q -analyze" to give helpful results.
But dynamic slots have their downfalls as well.  First off, the user has 
to tell HTCondor what their job needs in terms of cores, memory, etc, 
and what they tell HTCondor has real implications.  Unfortunately, many 
users simply have no idea what their jobs require or can effectively 
utilize, so sometimes static slots created by a system admin that is 
familiar with the workloads of the organization (esp if the cluster is 
used for a repetitive/predictable workload) could be better.  Also, 
dynamic slots currently do not work with startd RANK policies (i.e. if 
you have machines that need to prefer certain types of jobs), but we are 
currently working to fix that shortcoming.
Another complication with dynamic slots is starvation.  For instance, a 
simple dynamic slot setup could result in multicore jobs starving 
(waiting forever) if there is an infinite supply of incoming single core 
jobs.  The whole-machine-slots static recipe above gets around this 
problem by always prioritizing whole-machine jobs; if a whole-machine 
job matches, the server will then immediately "drain" out all the single 
core jobs (i.e. not start any new single core jobs while waiting for 
existing single core jobs to complete).  This "always prioritize large 
machine jobs" strategy is not ideal, and thus most whole-machine-slot 
sites deal with this by only setting up some percentage of their servers 
with a whole-machine-slot policy.  Draining costs utilization; you will 
have slots sitting around idle waiting for all the single-core jobs to 
exit. But either job preemption (i.e. killing a job before it is done 
and starting it over) or server draining is the price one must pay in 
order to avoid starvation of larger core jobs.  Dynamic slot configs are 
usually setup to do draining with the help of the condor_defrag service 
as described here http://goo.gl/Qh8UXu (or via some other external 
application-aware service issuing the condor_drain command-line tool), 
but note that the condor_defrag daemon is going to drain some percentage 
of servers regardless of if there are multi-core jobs submitted or not. 
 The condor_defrag service also produces better results if your cluster 
of servers is more homogenous, since in a very heterogeneous server pool 
it is possible the condor_defrag service may drain machines that no 
multicore jobs want.
All in all, the I think most sites that have switched to dynamic slots 
feel it is a definite improvement (esp in increased utilization), but 
there is not a clear and obvious winner in every case, and balancing our 
your draining policy can be tricky. Down the road we hope to keep 
enhancing HTCondor to make things easier/smarter, and to create better 
tools to communicate these tradeoffs happening on a cluster more 
directly to administrators.
I know the above is not a clear answer and probably more than you 
wanted, but hopefully will help give an idea of the tradeoffs involved.
regards,
Todd


Thanks and have a good weekend.

Donny Shrum
FSU RCC



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685