We have a user who is submitting a lot of jobs to our condor system. He’s hitting some limits and I want to work out how we can help. He would like to be able to have 2000-3000 jobs running simultaneously – we have enough nodes to cope with this – but actually submitting them is causing problems. Essentially his job is running the program but using slightly different parameters each time so he has a submit file with (eg) queue 500 at the end. He can submit about 500 jobs simultaneously and everything works but trying to submit more than that and his machine grinds to a halt – presumably the overhead of communicating with all the nodes is too much (the machine has 16GB RAM and
a reasonably decent CPU) If I give him (say) another 6 machines set up as submit nodes will this work or will we hit other bottlenecks (or is this too vague a question??) Thanks Steve |