[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] host failure detection

Date: Tue, 01 Aug 2006 12:36:33 -0700
From: Michael Thomas <thomas@xxxxxxxxxxxxxxx>
Subject: [Condor-users] host failure detection

We recently had a disk problem with one of the 60 machines in our condor
pool that caused jobs to fail quickly.  As a result, most jobs ended up
landing on this node, which generated a large number of failed jobs out
of the total job submissions.  Unfortuantely, we were not aware of this
failing node until we took a long look at the job output logs.

What kind of tools does condor provide for monitoring things like:
* Average job time to completion per node
* Number of jobs processed per node

Any sort of host-level monitoring information that we can get from
condor would be useful to plug into a higher-level monitoring system
like MonALISA, and allow us to detect such problems as they occur and
not days after the fact.

--Mike

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: [Condor-users] host failure detection
  - From: Matt Hope

Prev by Date: Re: [Condor-users] Negotiator problem? Jobs not assigned to idlemachines.
Next by Date: Re: [Condor-users] Negotiator problem? Jobs not assigned to idlemachines.
Previous by thread: Re: [Condor-users] dagman jobs are not in same cluster??
Next by thread: Re: [Condor-users] host failure detection
Index(es):
- Date
- Thread