Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Monitoring condor nodes with hobbit
- Date: Wed, 8 Jan 2014 15:15:40 -0600
- From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Monitoring condor nodes with hobbit
Hi Cody,
It's worth noting that HTCondor 8.1 now forwards all the stats Lans references (and more) to Ganglia "out-of-the-box" (for some value of "out-of-the-box").
However, I think you're more referring to health monitoring, right? Other than periodic probing (I do "condor_q -const false", for example), I can't think of anything overly clever.
As Ben mentioned, even periodic probing can be tough as it's difficult to differentiate "very busy" from "not responding".
Brian
On Jan 8, 2014, at 9:54 AM, Lans Carstensen <Lans.Carstensen@xxxxxxxxxxxxxx> wrote:
> Aside from setting up up/down monitoring to ensure that your
> collectors are healthy, submissions are working, and startd nodes
> haven't fallen out of a pool - the real monitoring value that's been
> added in the last few years is in operational statistics included in
> the negotiator and schedd daemon classads. It's worth your while to
> collect and graph some of those stats. I covered a couple of graphs
> we use a couple years ago at HTCondorWeek.
>
> http://research.cs.wisc.edu/htcondor/CondorWeek2012/presentations/carstensen-dreamworks.pdf
>
> -- Lans Carstensen
>
> On Wed, Jan 8, 2014 at 6:40 AM, Cody Belcher <codytrey@xxxxxxxxxxxxxxxx> wrote:
>> I suppose my end goal is to easily see when a node has an issue, but you are
>> right, I do get emails when say sched crashes or something. with out any
>> extra configuration I can use hobbit to see which hosts are on, and that
>> will work for my needs.
>>
>> Thanks,
>>
>> Cody
>>
>>
>> On 01/08/2014 08:13 AM, Ben Cotton wrote:
>>>
>>> Cody,
>>>
>>> When I was at Purdue, I tried monitoring HTCondor servers (i.e. not
>>> execute nodes) with Nagios. I eventually removed the checks because
>>> they didn't add value. The condor_master does a good job of making
>>> sure the daemons are running. I did get alerts for the schedd checks,
>>> but they turned out to be false alarms when the schedd was just too
>>> busy to answer the condor_q from Nagios. (I suppose that's an issue in
>>> itself, but it wasn't what we were checking for).
>>>
>>> I guess the point of this story is to ask what exactly you want to
>>> check and why. Knowing that makes it easier to offer guidance.
>>>
>>>
>>> Thanks,
>>> BC
>>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/