HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Feedback: Quill stress testing



Ian Chesal <ICHESAL@xxxxxxxxxx> writes:

> If I submit 300 1000-job clusters and then condor_rm all 300
> clusters with one command line the database grinds to a halt. I can
> see the quill daemon popping to the top of the cpu usage list on the
> scheduler very frequently...

Just to add another observation of the same phenomenon.

A user was submitting clusters of ~1000 jobs and then removing them,
repeating this process every few minutes (don't ask me why).

The first two clusters worked fine with the max difference between
"EnteredCurrentStatus" and "JobFinishedHookDone" being < 10 seconds.
My assumption is that this corresponds to the time between the remove
happening and the job actually leaving the queue.

The next three clusters were still OK, the max difference had now
increased, but was still < 40 seconds.

On the sixth cluster, the max difference suddenly increased to 2200
seconds, with values ranging all the way from 3 to 2200 seconds and
with no clear pattern relative to the ProcId (although there did seem
to be patterns every 10 and every 100 processes).

>From this point on, the schedd and postmaster processes were entirely
CPU bound, but jobs could still be submitted and removed.

There were 14 more clusters, but the max difference never increased
much beyond 2300 seconds and actually slowly decreased to 1500 seconds.
Also, the minimum difference was now usually within 100 seconds of the
maximum difference, meaning it was taking at least 35 minutes (slowly
lowering to 20 minutes) for any job to actually leave the queue.

After the final cluster it took 1/2 hour for things to clear up.

This is with Condor 6.7.12.

-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
(608) 262 - 9479	University of Wisconsin, Madison