Hi Thomas,
we had a similar problem because FOO was deploying very short-lived 1-core slots. This may be especially noticeable if you did not have BAZ submitting jobs before.
Since FOO's requirements can be satisfied with both 1- and 8-core jobs, a 1-core job will inherit an 8-core slot every now and then.
If the 1-core jobs of FOO are short-lived, they rapidly acquire and release those 1-core slots.
At some point, a freed 1-core slot is given to BAZ, which then claims it for a regular duration.
Since Condor cannot give the remaining resources to 8-core jobs, it hands them to FOO and BAZ again.
Over time, the FOO jobs die so fast they are gradually replaced with BAZ jobs.
As a result, 8-core slots rapidly degrade to long-lived 1-core slots for BAZ.
In our case, FOO did blow through the 1-core slots at 100Hz, but once BAZ acquired one it kept it for hours. So even though FOO *got* considerably more resources, it did not *keep* it.
We ended up adding a resource limit for the different types of jobs. Basically we can limit the number of 1-core jobs (regardless of user) in the cluster with a pseudo-resource. That works a lot better than draining entire nodes. It does not fix the problem, but protects BAR and does not penalise BAZ.
If FOO really is sending short-lived jobs, it is probably due to empty pilots. Just ask them to switch to ACT.
Cheers,
Max
> Am 23.11.2017 um 10:54 schrieb Thomas Hartmann <thomas.hartmann@xxxxxxx>:
>
> Hi all,
>
> we have currently an issue with our dynamic group quotas (for grid jobs
> via ARC CEs), which I do not understand.
> From our three main users, user BAZ got/occupied suddenly about half of
> our slots while having a significant nominal smaller share than users
> FOO and BAR.
> To curb BAZ's enthusiam, we changed for testing the dynamic quotas on
> the negotiator to
> GROUP_QUOTA_DYNAMIC_group_FOO = 0.933
> GROUP_QUOTA_DYNAMIC_group_BAR = 0.938
> GROUP_QUOTA_DYNAMIC_group_BAZ = 0.01
> ...
> GROUP_QUOTA_DYNAMIC_group_OTHER = 0.01
>
> without much of an effect.
>
> One difference between the users is, that BAZ runs only single slot
> jobs, FOO somewhat 1:1 single and 8slot jobs (so slot rate 1:8) and BAR
> only 8slot jobs - so one idea might be, that the single- vs. multi-core
> use pattern somehow distorts the share distribution (but than it worked
> fine before for the ratio between FOO and BAR)
>
> As far as I see, the ifthenelse nesting of the X509 DNs into the group
> names should be OK [1] (at least the brackets match). Also BAZ DN should
> match and if not should be covered by the OTHERS group as well.
>
> So, I wonder, why BAZ still gets ~50% of the slots? Maybe somebody has
> an idea for me?
>
> Cheers and thanks,
> Thomas
>
>
> [1]
> DESYAcctGroup = ifThenElse(x509UserProxyVOName =?= "FOO","group_FOO", \
> ifThenElse(x509UserProxyVOName =?= "BAR","group_BAR", \
> ifThenElse(x509UserProxyVOName =?= "BAZ","group_BAZ", \
> ifThenElse(x509UserProxyVOName =?= "MINOR1",
> "group_MINOR1", \
> ..., \
> "group_OTHER" ))))))))))))
>
> [2]
> DESYAcctSubGroup = ifThenElse(regexp("desyplt",Owner), "desyplt", \
> ifThenElse(regexp("desyprd",Owner), "desyprd", \
> ifThenElse(regexp("desysgm",Owner), "desysgm", \
> ifThenElse(regexp("desyusr",Owner), "desyusr", \
> ifThenElse(regexp("FOO",Owner) && RequestCpus > 1,
> "FOO_multicore", \
> ifThenElse(regexp("BAR",Owner) && RequestCpus > 1,
> "BAR_multicore", \
> ifThenElse(regexp("BAZ",Owner) && RequestCpus > 1,
> "BAZ_multicore", \
> "other" )))))))
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
Karlsruhe Institute of Technology (KIT)
SCC/SDM
Dr. Max Fischer
ALICE Collaboration Representative at GridKa
Hermann-von-Helmholtz-Platz 1
GebÃude 449
76344 Eggenstein-Leopoldshafen, Germany
Phone: +49 721 608-28328
E-mail: max.fischer@xxxxxxx
Web: www.scc.kit.edu
Registered office:
KaiserstraÃe 12, 76131 Karlsruhe, Germany
KIT â The Research University in the Helmholtz Association
Attachment:
smime.p7s
Description: S/MIME cryptographic signature