Hi Dan,
Thanks, GROUP_DYNAMIC_MACH_CONSTRAINT fixes my problem. We use
glideins with
the monitoring slot enabled so our pool always thinks we have twice
as many
slots as we actually have usable. Setting
GROUP_DYNAMIC_MACH_CONSTRAINT = ( IS_MONITOR_VM =!= True )
makes it ignore them and should get things back to normal.
It still seems like <none> should be considered a "special" group
and should
always be handed slots last in the list of groups instead of based
on it's
starvation rate as the rest of the groups are sorted. The cluster
administrator set the quotas on the real defined groups because they
care
about those getting slots and the <none> group which doesn't
actually exist
and is made up on the fly should be forced to take a back seat to
the defined
groups and not be put before them even if their starvation rate puts
them
higher. It seems like this would fix a lot of issues related to this.
Thanks for the quick response!!
joe
On 10/18/2011 04:16 PM, Dan Bradley wrote:
On 10/18/11 4:02 PM, Erik Erlandson wrote:
If you search for the first "Matched" line what you'll see is
that the
jobs that were submitted without a group are now apparently in
the group
"<none>" and that group actually has a quota somehow (the group
doesn't
actually exist so it certainly doesn't have a quota). Jobs for that
"group" get run in front of the groups that haven't filled their
quota.
Hi Joe,
Accounting groups were enhanced to support fully generalized
Hierarchical Accounting Groups (HGQ), as of 7.5.6:
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1393
There is always a root node in the accounting group hierarchy, whose
name is "<none>", and any job that does not map to some other
accounting
group will be assigned to<none>. This group always accepts any
surplus
quota not used by other groups.
You may also be interested in:
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1926
(dev release only, not on current stable series 7.6)
Joe,
It sounds like in your case, the <none> group is being considered
before most
other groups because it is more "starved", meaning it is using a
smaller
fraction of its share of the pool compared to most other groups. If
I understand
the new group quota system correctly, group <none>'s share of the
pool is
determined by computing the share of the pool for all the other
groups and
counting what is left. In your pool there are 10216 slots. 5682 of
these are
being assigned to group <none>.
One thing that can cause trouble is if you have special slots that
are not
available to all jobs. In this case, the size of the pool may be
effectively
overestimated. The result is that dynamic quotas are too big, and
groups which
are considered first may get too many slots, while groups that
follow will
starve. GROUP_DYNAMIC_MACH_CONSTRAINT can be used to attempt to
work around this
problem. So can GROUP_QUOTA_ROUND_ROBIN_RATE.
I haven't had a chance to consider your case carefully enough to
make a specific
recommendation. If you continue to have trouble, I recommend
opening a help
ticket with condor-admin@xxxxxxxxxxxx
--Dan
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/