Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] group quotas - Nagios tests with "Unspecified gridmanager error"
- Date: Wed, 5 Mar 2014 16:32:25 -0600
- From: Jaime Frey <jfrey@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] group quotas - Nagios tests with "Unspecified gridmanager error"
On Mar 3, 2014, at 9:44 AM, L Kreczko <L.Kreczko@xxxxxxxxxxxxx> wrote:
> After successfully deploying user quotas and priorities, the system
> did stop working for the Nagios tests. Nagios tests are mapped to one
> of the groups: cms.admin, ops.admin or dteam.
>
> In the configuration (see attachment fairshares.config) I specify the
> priorities with
> GROUP_PRIO_FACTOR_group_cms.admin = 100.0
> GROUP_PRIO_FACTOR_group_dteam = 100.0
> GROUP_PRIO_FACTOR_group_ops = 100.0
> # all other groups have a priority of 10000.0
> and the quotas with
> GROUP_QUOTA_DYNAMIC_group_cms = 0.80
> GROUP_QUOTA_DYNAMIC_group_cms.admin = 0.05
> GROUP_QUOTA_DYNAMIC_group_dteam = 0.02
> GROUP_QUOTA_DYNAMIC_group_ops = 0.05
>
> with 324 available slots this should give the 3 groups between 6.48
> (dteam) and 16.2 slots (ops). The NegotiatorLog(attached,
> negotiator_admin.log) confirms these numbers:
> 03/03/14 14:57:57 group quotas: fairshare (1): group= group_cms.admin
> quota= 12.3429 requested= 0
>
> However, looking at the I have only one slot for ops and none for the
> other two:
> 03/03/14 14:57:57 group quotas: Group group_dteam allocated= 0 usage= 0
> 03/03/14 14:57:57 group quotas: Group group_ops allocated= 0 usage= 0
> 03/03/14 14:57:57 group quotas: Group group_cms.admin allocated= 0 usage= 0
> and everything is used by the cms group:
> 03/03/14 14:57:57 group quotas: Group group_cms allocated= 315 usage= 315
> 03/03/14 14:57:57 group quotas: Group group_cms.production allocated=
> 7 usage= 7
I assume you’re using all of these group quota settings for matching of vanilla universe or similar jobs, not grid universe jobs.
> And all Nagios jobs are aborted with
> ================================================
> - Got a job held event, reason: Unspecified gridmanager error
> - Got a job held event, reason: Unspecified gridmanager error
> - Job got an error while in the CondorG queue.
> Status Reason: hit job retry count (0)
> ================================================
>
> Am I doing something wrong?
To debug these failures, I’d have to see the underlying HTCondor submit files and gridmanger logs.
Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project