[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Deployment Recommendations

Date: Sat, 25 Apr 2009 09:21:25 -0500
From: Matthew Farrellee <matt@xxxxxxxxxx>
Subject: Re: [Condor-users] Deployment Recommendations

James Osborne wrote:
> Dear All
> 
> My name is James Osborne and I am the Condor Project Manger at Cardiff 
> University in the UK.  Now that summer is approaching, and I have some 
> nice new virtualization infrastructure coming on stream, I am in the 
> process of virtualizing our Condor infrastructure.  I already have a 
> virtual submit machine which works very well with surprisingly low 
> overhead (I couldn't push it harder than about 4% cpu usage with 000s of 
> 15 minute jobs in the queue).  The virtualization infrastructure will soon 
> be a load-balanced pair of 3GHz dual-socket quad-core machines with 32GB 
> of RAM each with multiple redundant connections into FC storage.
> 
> I seem to remember hearing that a good 'rule of thumb' was to have no more 
> than 2000 execute nodes reporting to a single central manager.
> 
> 1) Is that still the case ? 
> 
> 2) Has anybody pushed a single central manager to about 9000 execute nodes 
> ?
> 
> 3) Does it make more sense to deploy 4-5 central managers instead and use 
> flocking ?
> 
> 4) If so, would you for example use one central manager per core network 
> router even if that increased the number of managers to 8 or more ?
> 
> 5) Has anybody tried to flock jobs to 8 or more central managers ?
> 
> I can already see how I can set execute nodes to report to different 
> central managers in my Condor distribution scripts. 
> 
> I look forwards to hearing from those of you with big pools...
> 
> Thanks in advance.  Best regards
> 
> James

It seems that may depend on if you are using strong authentication or
not, though 2000 number seems pretty out of date. You should have a look
at the talk by Igor and Dan from Condor Week 2008 about some of their
scaling.

	http://www.cs.wisc.edu/condor/CondorWeek2009/wednesday.html

They used a hierarchy of collectors to decrease the load on the central
manager, and new developments in 7.3.1 allow for a single collector to
handle all their load itself.

Best,


matt

References:
- [Condor-users] Deployment Recommendations
  - From: James Osborne

Prev by Date: Re: [Condor-users] Failed to receive expected size of a file from spool directory
Next by Date: Re: [Condor-users] Deployment Recommendations
Previous by thread: [Condor-users] Deployment Recommendations
Next by thread: Re: [Condor-users] Deployment Recommendations
Index(es):
- Date
- Thread