While disk failures may be the biggest black hole cause, I'm also
interested in looking at a general solution.
> On the Dell machines this event mechanism is very fast (seconds)
> whereas on the HP's it can be as much as 5 mins.
Even a 5 minute delay would be preferable to the situation I have now.
But I see your point in using a system-level error checking script which
can automatically update the condor classad for that machine. I got the
same suggestion on another list.
Additionally, one thing I would really like to see is a way to get these
per-host statistics into a higher level monitoring infrastructure like
MonALISA. I already monitor the cluster load, network IO, and per-VO
jobs in MonALISA. If condor provided a way to obtain # jobs completed
per node, and average time to completion per node, it would help me to
detect both a black hole and underperforming nodes at the same time.