HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] Feedback: Quill stress testing



Title: Feedback: Quill stress testing

Can someone feed this back to Ameet Kini for me? Thanks.

We've been putting some stress on our test system to see where it breaks and I thought you might want to hear about the results. Our current system design is:

        - one centralized schedd with a corresponding quill daemon on a dual process, 3 GHz, Xeon machine with GB fiber network running fedora core 3.

        - one negotiator/collector machine of moderate stature

        - 11 startds on various windows machines

We've been running through submitting large numbers of jobs to see how the system performs. For submission things are generally pretty good. Our clusters have 1000 jobs in them and submitting about 300 in series shows no serious lag in the system. The DB is behind by usually no more than 60 seconds. It's when we remove large numbers of jobs that the quill binary really doesn't do well.

If I submit 300 1000-job clusters and then condor_rm all 300 clusters with one command line the database grinds to a halt. I can see the quill daemon popping to the top of the cpu usage list on the scheduler very frequently. I can still submit jobs, the schedd can get CPU, but the quill daemon is so tied up with managing the DB for the massive job removal that to end users running condor_q the system looks frozen (new jobs don't show up, jobs finishing aren't registering as complete, and the removed jobs disappear from the queue list very slowly). I end up stopping the daemon and resetting the system after waiting about 20 minutes for it to process the massive removal.

As for DB maintenance we've found that frequently re-indexing the tables improves performance considerably. We have added a function to our quill DB called manage_db() that does a reindex on all the tables. We're not vacuuming since quill should be doing this automatically. How often does quill reindex? You might want to consider a more frequent reindex interval. With 20k+ jobs moving in and our of our system in a day during our stress tests we found reindexing an absolute necessity every couple of hours otherwise performance suffered considerably. At the very least, could you add a PGSQL function to quill to do re-indexing on demand? It would make adding additional schedds to the system easier -- right now we have to let quill set up the database and then add a bunch of additional stuff to the schema like this function.

If you do increase the re-indexing intervals please let us know and we'll drop our side script that forces re-indexing. It'd be one less thing for us to worry about.

Thanks!

- Ian

--

Ian R. Chesal <ichesal@xxxxxxxxxx>

Senior Software Engineer

Altera Corporation

Toronto Technology Center

Tel: (416) 926-8300

--

Ian R. Chesal <ichesal@xxxxxxxxxx>

Senior Software Engineer

Altera Corporation

Toronto Technology Center

Tel: (416) 926-8300