At Marquette, our Condor pools have been growing and we seem to be at
a tipping point in terms of performance. We have recently configured
the job router on our primary cluster to route jobs to our other
pools across campus using Condor-C (flocking isn't really an option),
giving us over 1600 available slots.
Our current Condor 7.4.4 setup has the collector, negotiator, job
router and schedd all running on the head node (an 8 core machine
with 24 GB of RAM, 2 x 1 Gbs networks, 1 x 20Gbs Infiniband). When we
launch a few thousand jobs capable of being routed, the system is
fine for a while, but eventually the schedd becomes unresponsive and
the overall head node load skyrockets due to the number of running
shadow daemons.
Should we consider partitioning our Condor daemons onto different
nodes? What partitioning works best? Would a second schedd, to handle
the routed jobs, be helpful? What have others done and what seems to
work well?
Thanks.
Craig -- Craig A. Struble, Ph.D. | Marquette University Associate
Professor of Computer Science | 369 Cudahy Hall (414)288-3783 |
(414)288-5472 (fax) http://www.mscs.mu.edu/~cstruble |
craig.struble@xxxxxxxxxxxxx