HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Condor 7.6.1 timeline?




On May 19, 2011, at 3:03 PM, Erik Erlandson wrote:

On Thu, 2011-05-19 at 14:04 -0500, Brian Bockelman wrote:
What's the timeline for the 7.6.1 release (I believe Todd promised at Condor week that development timelines would be posted to this list)?

I ask because we just got hit by a real nasty negotiator starvation issue that locally allowed one user group to starve everyone else locally:

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2176

Hi Brian,

Declaring static quotas larger than the quota available to the parent (which is what tkt #2176 configuration does, as you correctly describe) is somewhat of a no-no, although it will do its best by rescaling (with a warning).

Part of the issue is that quotas are by necessity assigned prior to the redistribution of surplus.   The logic of mixed static and dynamic quotas requires that static quotas are serviced first, which is why you ended up with zero quota left for group cms.other.  Quota surplus is then shared in proportion to quota for nonzero quotas (with any remaining given to zero quotas equally).  Therefore cms.prod, with nonzero quota, gets all the surplus, and so cms.other gets none.

Yup - that's what I had diagnosed too - but isn't this incorrect or at least counter-intuitive?

Another way to state what happened is:
- If there's enough quota in the parent to satisfy the static group, the static group gets their quota.  Any leftover amount is given to dynamic groups.
- If the static group quotas can't be filled initially and there are surplus batch slots, then dynamic groups get nothing regardless of the amount of the surplus.

Another way to look at this is that surplus allocation should be a near-continuous function.  At some point, we can add a single slot to the pool and the allocation goes from "cms.prod gets all surplus" to "cms.prod obeys its quota and cms.other gets all the surplus".  Small changes to the input should not cause large swings in the resulting allocation.


Given the dynamic nature of slot availability on a condor pool, I tend to recommend that static quotas are not used at all, unless there are compelling reasons to use them.   When they are used, they should ideally be declared small enough that a dip in reporting slots does not cause a static quota to become larger than the quota available to its parent.


We have a compelling reason to mix dynamic and static slots:  of the CMS share, 500 slots in the cluster should go to cms.prod and the rest of the slots should be distributed to the rest of cms (if you want to see the full config, the pool is at red-condor.unl.edu:9168).  Prior to hierarchical groups, we used to achieve this by putting 2 orders of magnitude between everyone's condor priority.

In our case, the group with the static quota of 500 was able to get about 2000 cores, completely starving out the dynamic quotas.

Brian

Attachment: smime.p7s
Description: S/MIME cryptographic signature