[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] negotiator "poor" performance issue



On 3/14/2014 5:42 AM, Pek Daniel wrote:
Hi,

Hi Daniel, some thoughts inline...

I assigned to the jobs I submitted randomized priorities, because
otherwise the negotiator would go through the schedds sequentially
(first, it runs all the jobs from schedd1, then from schedd2, etc).
I've also set:
USE_GLOBAL_JOB_PRIOS = true

Just FYI -  the negotiator communicates with schedds in user priority 
order regardless of schedd.  So if your jobs were submitted from 
different users (or with different accounting_groups), the negotiator 
would not go through all the schedds sequentially.
I don't use job arrays or clusters and I can't consider using them,
this is a constraint.

^^^ This is a bummer...

In this way, I could achieve ~10 jobs / sec negotiation (dispatching)
rate (not using priorities doesn't change this).

My questions:
- did anybody measure before a higher dispatch rate?
- is this 10 jobs / sec considered a "normal" or "good enough" value
in case of HTCondor?
Of course we are always working to improve the at which the negotiator 
makes matches, and we have several ideas/plans on the horizon.
However, negotiator match rate for most real-world scenarios is not as 
important as it may seem.  The reason is because negotiator match rate 
has little to do with job start rate in HTCondor.  When the negotiator 
makes a match, it hands it out to a schedd.  This schedd then claims the 
slot, and starts a job.  A key point is that when the job completes, the 
schedd will find another job from that same user that matches the slot 
and start it **without any involvement from the condor_negotiator**. 
The schedd will keep using and reusing a slot it has claimed for job 
after job until the match is broken.  With a default CLAIM_WORKLIFE (see 
http://goo.gl/VOg9nm ) of an hour there are not typically that many 
Unclaimed machines on any given negotiation cycle (i.e. machines that 
are not already assigned to a schedd) that the negotiator has to worry 
about.  In other words, the negotiator is not typically involved at job 
boundaries, but only when claims need to move from one user/schedd to 
another due to priorities...
Hope the above makes sense...

- can I do anything without touching the source to increase the
negotiation performance?

Tuning knobs like NEGOTIATOR_INFORM_STARTD could help, but not sure how 
much.  I guess you also need to think about how important/relevant of a 
metric negotiator dispatch rate is for your scenario.  Maybe sustained 
job completion rate makes more sense.  See
http://research.cs.wisc.edu/htcondor/CondorWeek2011/presentations/tannenba-roadmap.pdf
for a bunch of performance graphs starting around slide 18. For example, tests back with v7.6.0 showed a negotiator matchmaking rate of 8 per second (close to what you found), but because the schedd reuses matches, the sustained job completion rate for just one schedd was 80 jobs/second. And of course, you can scale job completion rate horizontally by adding more schedds.
You may find the following paper of interest, even though it is getting 
a bit old:
Dan Bradley, Timothy St Clair, Matthew Farrellee, Ziliang Guo, Miron 
Livny, Igor Sfiligoi, and Todd Tannenbaum, "An update on the scalability 
limits of the Condor batch system", Journal of Physics: Conference 
Series, Vol. 331, No. 6, 2011
http://research.cs.wisc.edu/htcondor/doc/chep10_condor_scalability.pdf

regards,
Todd

p.s. Also be aware the negotiator classad ("condor_status -negotiator -l") publishes a number of statistics related to matchmaking performance, see http://goo.gl/BbIp9R . Useful for graphing with condor_gandliad
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685