[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] host failure detection

Date: Wed, 2 Aug 2006 17:17:11 +0100
From: "Matt Hope" <matthew.hope@xxxxxxxxx>
Subject: Re: [Condor-users] host failure detection

On 8/2/06, Michael Thomas <thomas@xxxxxxxxxxxxxxx> wrote:

While disk failures may be the biggest black hole cause, I'm also
interested in looking at a general solution.

> On the Dell machines this event mechanism is very fast (seconds)
> whereas on the HP's it can be as much as 5 mins.

Even a 5 minute delay would be preferable to the situation I have now.
But I see your point in using a system-level error checking script which
can automatically update the condor classad for that machine.  I got the
same suggestion on another list.

Additionally, one thing I would really like to see is a way to get these
per-host statistics into a higher level monitoring infrastructure like
MonALISA.  I already monitor the cluster load, network IO, and per-VO
jobs in MonALISA.  If condor provided a way to obtain # jobs completed
per node, and average time to completion per node, it would help me to
detect both a black hole and underperforming nodes at the same time.


Have you looked at Hawkeye?

http://www.cs.wisc.edu/condor/hawkeye/

I don't use it myself but a lot of others on this list do...

Matt

References:
- [Condor-users] host failure detection
  - From: Michael Thomas
- Re: [Condor-users] host failure detection
  - From: Matt Hope
- Re: [Condor-users] host failure detection
  - From: Michael Thomas

Prev by Date: [Condor-users] expectations for mixing versions in same pool
Next by Date: Re: [Condor-users] expectations for mixing versions in same pool
Previous by thread: Re: [Condor-users] host failure detection
Next by thread: Re: [Condor-users] Negotiator problem? Jobs not assigned to idlemachines.
Index(es):
- Date
- Thread