Aside from setting up up/down monitoring to ensure that your
collectors are healthy, submissions are working, and startd nodes
haven't fallen out of a pool - the real monitoring value that's been
added in the last few years is in operational statistics included in
the negotiator and schedd daemon classads. It's worth your while to
collect and graph some of those stats. I covered a couple of graphs
we use a couple years ago at HTCondorWeek.
http://research.cs.wisc.edu/htcondor/CondorWeek2012/presentations/carstensen-dreamworks.pdf
-- Lans Carstensen
On Wed, Jan 8, 2014 at 6:40 AM, Cody Belcher <codytrey@xxxxxxxxxxxxxxxx> wrote:
I suppose my end goal is to easily see when a node has an issue, but you are
right, I do get emails when say sched crashes or something. with out any
extra configuration I can use hobbit to see which hosts are on, and that
will work for my needs.
Thanks,
Cody
On 01/08/2014 08:13 AM, Ben Cotton wrote:
Cody,
When I was at Purdue, I tried monitoring HTCondor servers (i.e. not
execute nodes) with Nagios. I eventually removed the checks because
they didn't add value. The condor_master does a good job of making
sure the daemons are running. I did get alerts for the schedd checks,
but they turned out to be false alarms when the schedd was just too
busy to answer the condor_q from Nagios. (I suppose that's an issue in
itself, but it wasn't what we were checking for).
I guess the point of this story is to ask what exactly you want to
check and why. Knowing that makes it easier to offer guidance.
Thanks,
BC
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/