My name is James Osborne and I am the
Condor Project Manger at Cardiff University in the UK. Now that summer
is approaching, and I have some nice new virtualization infrastructure
coming on stream, I am in the process of virtualizing our Condor infrastructure.
I already have a virtual submit machine which works very well with
surprisingly low overhead (I couldn't push it harder than about 4% cpu
usage with 000s of 15 minute jobs in the queue). The virtualization
infrastructure will soon be a load-balanced pair of 3GHz dual-socket quad-core
machines with 32GB of RAM each with multiple redundant connections into
FC storage.
I seem to remember hearing that a good
'rule of thumb' was to have no more than 2000 execute nodes reporting
to a single central manager.
1) Is that still the case ?
2) Has anybody pushed a single central
manager to about 9000 execute nodes ?
3) Does it make more sense to deploy
4-5 central managers instead and use flocking ?
4) If so, would you for example use
one central manager per core network router even if that increased the
number of managers to 8 or more ?
5) Has anybody tried to flock jobs to
8 or more central managers ?
I can already see how I can set execute
nodes to report to different central managers in my Condor distribution
scripts.
I look forwards to hearing from those
of you with big pools...