[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] group quotas - Nagios tests with "Unspecified gridmanager error"



Hi,

Sorry for not getting back, but the issue was fixed. It is an old bug
in the interaction between ARC CE and HTCondor:
"If this affects only jobs submitted through a glite-WMS (i.e.
ATLAS/CMS jobs are fine), then I would say it's this problem:

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4017

Since most/all ATLAS and CMS pilot factories have updated Condor to
the latest version they're not affected, but glite-WMS uses an ancient
version. This problem has been fixed also in ARC:

https://bugzilla.nordugrid.org/show_bug.cgi?id=3327";

The fix is simple and involves changing /usr/share/arc/submit-condor-job:
https://github.com/HEP-Puppet/arc_ce/blob/master/fixes.md

I just forgot to re-apply it after reinstalling the ARC CE.

Cheers,
Luke

On 5 March 2014 22:32, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
> On Mar 3, 2014, at 9:44 AM, L Kreczko <L.Kreczko@xxxxxxxxxxxxx> wrote:
>
>> After successfully deploying user quotas and priorities, the system
>> did stop working for the Nagios tests. Nagios tests are mapped to one
>> of the groups: cms.admin, ops.admin or dteam.
>>
>> In the configuration (see attachment fairshares.config) I specify the
>> priorities with
>> GROUP_PRIO_FACTOR_group_cms.admin =  100.0
>> GROUP_PRIO_FACTOR_group_dteam =  100.0
>> GROUP_PRIO_FACTOR_group_ops =  100.0
>> # all other groups have a priority of 10000.0
>> and the quotas with
>> GROUP_QUOTA_DYNAMIC_group_cms =  0.80
>> GROUP_QUOTA_DYNAMIC_group_cms.admin =  0.05
>> GROUP_QUOTA_DYNAMIC_group_dteam =  0.02
>> GROUP_QUOTA_DYNAMIC_group_ops =  0.05
>>
>> with 324 available slots this should give the 3 groups between 6.48
>> (dteam) and 16.2 slots (ops). The NegotiatorLog(attached,
>> negotiator_admin.log) confirms these numbers:
>> 03/03/14 14:57:57 group quotas: fairshare (1): group= group_cms.admin
>> quota= 12.3429  requested= 0
>>
>> However, looking at the  I have only one slot for ops and none for the
>> other two:
>> 03/03/14 14:57:57 group quotas: Group group_dteam  allocated= 0  usage= 0
>> 03/03/14 14:57:57 group quotas: Group group_ops  allocated= 0  usage= 0
>> 03/03/14 14:57:57 group quotas: Group group_cms.admin  allocated= 0  usage= 0
>> and everything is used by the cms group:
>> 03/03/14 14:57:57 group quotas: Group group_cms  allocated= 315  usage= 315
>> 03/03/14 14:57:57 group quotas: Group group_cms.production  allocated=
>> 7  usage= 7
>
> I assume you're using all of these group quota settings for matching of vanilla universe or similar jobs, not grid universe jobs.
>
>> And all Nagios jobs are aborted with
>> ================================================
>> - Got a job held event, reason: Unspecified gridmanager error
>> - Got a job held event, reason: Unspecified gridmanager error
>> - Job got an error while in the CondorG queue.
>> Status Reason: hit job retry count (0)
>> ================================================
>>
>> Am I doing something wrong?
>
>
> To debug these failures, I'd have to see the underlying HTCondor submit files and gridmanger logs.
>
> Thanks and regards,
> Jaime Frey
> UW-Madison HTCondor Project
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



-- 
*********************************************************
  Dr Lukasz Kreczko            +44 (0)117 928 8724
  CMS Group
  School of Physics
  University of Bristol
*********************************************************