Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Deployment Recommendations
- Date: Sat, 25 Apr 2009 09:21:25 -0500
- From: Matthew Farrellee <matt@xxxxxxxxxx>
- Subject: Re: [Condor-users] Deployment Recommendations
James Osborne wrote:
> Dear All
>
> My name is James Osborne and I am the Condor Project Manger at Cardiff
> University in the UK. Now that summer is approaching, and I have some
> nice new virtualization infrastructure coming on stream, I am in the
> process of virtualizing our Condor infrastructure. I already have a
> virtual submit machine which works very well with surprisingly low
> overhead (I couldn't push it harder than about 4% cpu usage with 000s of
> 15 minute jobs in the queue). The virtualization infrastructure will soon
> be a load-balanced pair of 3GHz dual-socket quad-core machines with 32GB
> of RAM each with multiple redundant connections into FC storage.
>
> I seem to remember hearing that a good 'rule of thumb' was to have no more
> than 2000 execute nodes reporting to a single central manager.
>
> 1) Is that still the case ?
>
> 2) Has anybody pushed a single central manager to about 9000 execute nodes
> ?
>
> 3) Does it make more sense to deploy 4-5 central managers instead and use
> flocking ?
>
> 4) If so, would you for example use one central manager per core network
> router even if that increased the number of managers to 8 or more ?
>
> 5) Has anybody tried to flock jobs to 8 or more central managers ?
>
> I can already see how I can set execute nodes to report to different
> central managers in my Condor distribution scripts.
>
> I look forwards to hearing from those of you with big pools...
>
> Thanks in advance. Best regards
>
> James
It seems that may depend on if you are using strong authentication or
not, though 2000 number seems pretty out of date. You should have a look
at the talk by Igor and Dan from Condor Week 2008 about some of their
scaling.
http://www.cs.wisc.edu/condor/CondorWeek2009/wednesday.html
They used a hierarchy of collectors to decrease the load on the central
manager, and new developments in 7.3.1 allow for a single collector to
handle all their load itself.
Best,
matt