Hi Thomas, we had a similar problem because FOO was deploying very short-lived 1-core slots. This may be especially noticeable if you did not have BAZ submitting jobs before. Since FOO's requirements can be satisfied with both 1- and 8-core jobs, a 1-core job will inherit an 8-core slot every now and then. If the 1-core jobs of FOO are short-lived, they rapidly acquire and release those 1-core slots. At some point, a freed 1-core slot is given to BAZ, which then claims it for a regular duration. Since Condor cannot give the remaining resources to 8-core jobs, it hands them to FOO and BAZ again. Over time, the FOO jobs die so fast they are gradually replaced with BAZ jobs. As a result, 8-core slots rapidly degrade to long-lived 1-core slots for BAZ. In our case, FOO did blow through the 1-core slots at 100Hz, but once BAZ acquired one it kept it for hours. So even though FOO *got* considerably more resources, it did not *keep* it. We ended up adding a resource limit for the different types of jobs. Basically we can limit the number of 1-core jobs (regardless of user) in the cluster with a pseudo-resource. That works a lot better than draining entire nodes. It does not fix the problem, but protects BAR and does not penalise BAZ. If FOO really is sending short-lived jobs, it is probably due to empty pilots. Just ask them to switch to ACT. Cheers, Max > Am 23.11.2017 um 10:54 schrieb Thomas Hartmann <thomas.hartmann@xxxxxxx>: > > Hi all, > > we have currently an issue with our dynamic group quotas (for grid jobs > via ARC CEs), which I do not understand. > From our three main users, user BAZ got/occupied suddenly about half of > our slots while having a significant nominal smaller share than users > FOO and BAR. > To curb BAZ's enthusiam, we changed for testing the dynamic quotas on > the negotiator to > GROUP_QUOTA_DYNAMIC_group_FOO = 0.933 > GROUP_QUOTA_DYNAMIC_group_BAR = 0.938 > GROUP_QUOTA_DYNAMIC_group_BAZ = 0.01 > ... > GROUP_QUOTA_DYNAMIC_group_OTHER = 0.01 > > without much of an effect. > > One difference between the users is, that BAZ runs only single slot > jobs, FOO somewhat 1:1 single and 8slot jobs (so slot rate 1:8) and BAR > only 8slot jobs - so one idea might be, that the single- vs. multi-core > use pattern somehow distorts the share distribution (but than it worked > fine before for the ratio between FOO and BAR) > > As far as I see, the ifthenelse nesting of the X509 DNs into the group > names should be OK [1] (at least the brackets match). Also BAZ DN should > match and if not should be covered by the OTHERS group as well. > > So, I wonder, why BAZ still gets ~50% of the slots? Maybe somebody has > an idea for me? > > Cheers and thanks, > Thomas > > > [1] > DESYAcctGroup = ifThenElse(x509UserProxyVOName =?= "FOO","group_FOO", \ > ifThenElse(x509UserProxyVOName =?= "BAR","group_BAR", \ > ifThenElse(x509UserProxyVOName =?= "BAZ","group_BAZ", \ > ifThenElse(x509UserProxyVOName =?= "MINOR1", > "group_MINOR1", \ > ..., \ > "group_OTHER" )))))))))))) > > [2] > DESYAcctSubGroup = ifThenElse(regexp("desyplt",Owner), "desyplt", \ > ifThenElse(regexp("desyprd",Owner), "desyprd", \ > ifThenElse(regexp("desysgm",Owner), "desysgm", \ > ifThenElse(regexp("desyusr",Owner), "desyusr", \ > ifThenElse(regexp("FOO",Owner) && RequestCpus > 1, > "FOO_multicore", \ > ifThenElse(regexp("BAR",Owner) && RequestCpus > 1, > "BAR_multicore", \ > ifThenElse(regexp("BAZ",Owner) && RequestCpus > 1, > "BAZ_multicore", \ > "other" ))))))) > > > _______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/ Karlsruhe Institute of Technology (KIT) SCC/SDM Dr. Max Fischer ALICE Collaboration Representative at GridKa Hermann-von-Helmholtz-Platz 1 GebÃude 449 76344 Eggenstein-Leopoldshafen, Germany Phone: +49 721 608-28328 E-mail: max.fischer@xxxxxxx Web: www.scc.kit.edu Registered office: KaiserstraÃe 12, 76131 Karlsruhe, Germany KIT â The Research University in the Helmholtz Association
Attachment:
smime.p7s
Description: S/MIME cryptographic signature