[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Arbitrarily limit the number of dynamic slots



Hi Todd, Todd, and Cole,

thanks for the sound advices, and happy new year !


Todd Tannenbaum wrote:
> To achieve the above desired policy, you could leverage the fact that all
> slots have an attribute "TotalSlots" which is the number of slots being
> advertised by that startd. So perhaps just a startd START expression to
> look at this, keeping in mind that the partitionable slot itself counts as
> one, so to have 6 jobs at most running you want TotalSlots kept to be at 7
> or below.
> 
> So you could drop the following into your config (and do a condor_reconfig or restart of the startd):
> 
> # Setup up the Execution Point (EP, i.e. the startd) to use dynamic slots, but
> # never run more than 6 jobs at most (no matter how the EP is carved up).
> use feature:PartitionableSlot
> START = $(START) && ( TotalSlots <= 7 )


Plain and simple, it just works. And it will let me defined a different
slots count per machine profile.

Thanks for the tip and the extensive explanation.


To answer Todd L Miller question, here are a bit more details about our
workloads.

We run compositing, 3D rendering and physics simulation tasks for VFX
effects in movies, all with in-house software.

When submitting a job to the HTCondor queue, the VFX graphic artists can
set an amount of CPU and memory they need for the task, with sensible
default values. HTCondor does the matching and everything is nice. But
our software does not take these settings into account to tune the
thread scheduler (TBB for most parts).

For example, if you submit a task with cpu=4 and memory=64GB, and you
end up on a machine with 32 cores/128GB of RAM, our software will create
32 threads, not just 4.

What I found is that thanks to various inneficiencies in the
multithreading code and the fact that sometimes CPUs wait for memory or
IO, running multiples jobs at once on a single machine is a good deal
for us. My tests showed that it can increase the renderfarm throughput
by a factor of around 3 *for free*. But when I cram too many tasks, the
gain factor starts to nose down, and the risk to end up out of memory
increases. It appears we are at a sweet spot somewhere between 4 to 6
tasks for most machines (and just one for some antique servers).


Of course, we could put some efforts into a more fine-grained resources
management (like with cgroups and TBB tuning), but this is not in our
roadmap for now.

So, in this case, good enough is still good !


If you want/need more information, feel free to ask.


Enjoy your day,


-- 
Charles