On 9/28/06, Zeeuw, L.V. de <L.V.de.Zeeuw@xxxxxx> wrote:
LS, We are facing more or less the same challenge. We have a large pool (>1500 XP execution nodes) and one central machine from which we submit jobs. When we submit small jobs, which should run for about 30 seconds, then if we submit say 1000 of such jobs it would take 45 minutes for the results to come back to the submitting host from the hundreds of available execution hosts. So, also for us, any pointers are appreciated to optimize for small jobs.
Forget the small jobs for now (though they won't help) - the setup you have is flat out untenable for throughput with condor* You have placed a vast farm with a bottleneck and central point of failure. This is a bad idea. If you wish to have a central submit point that's fine - just farm off the actual submit to one of several schedds. I appreciate this rather blase statement is harder to implement than it sounds, but trust me on this - you will never get decent performance out of a 1500 node farm with one schedd. If you cannot make your jobs more 'chunky' then there is the alternate possibility of using something like Technion's condor enhancements to reduce the cost associated with the matchmaking and claim process. This effectively does the chunking for you but will never be as good at it as you can be(especially if you avoid re transfering input data). * If you meant you had a central submit machine with multiple schedd daemons running on then less so but still bad Matt