Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Matching to not responding machines

Date: Wed, 28 Mar 2012 12:14:32 +0200
From: Rob de Graaf <r.degraaf@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Matching to not responding machines

Hi Hermann,

On 03/28/2012 11:32 AM, Hermann Fuchs wrote:

However, I would like to implement some kind of a failure detection for
the running grid as network problems will and do occur.
Is there a query which is only answered when the machines do
communicate?
condor_status seems to be misleading, the machines listed there which
stopped communicating remain there in some cases (e.g. the mentioned
case).

You could use INVALIDATE_STARTD_ADS (man condor_advertise) to make thecollector forget about specific machines. You would need to know whichmachines to invalidate. The only way I can think of right now is to askthem directly (condor_status -direct or maybe condor_config_val) andcheck the exit status of those commands. The downside of this approachis that you will have to endure a timeout for every machine that has theproblem. If you have hundreds or thousands of machines, it will quicklybecome unfeasible.

Alternatively, you could tweak CLASSAD_LIFETIME on the collector to makeit forget about unresponsive machines more quickly, but it might alsoaccidentally invalidate working machines if any updates get lost on thenetwork. See:http://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#SECTION004316000000000000000


Regards,

Rob

Follow-Ups:
- Re: [Condor-users] Matching to not responding machines
  - From: Hermann Fuchs

References:
- [Condor-users] Matching to not responding machines
  - From: Hermann Fuchs
- Re: [Condor-users] Matching to not responding machines
  - From: Wilding, Kevan A
- Re: [Condor-users] Matching to not responding machines
  - From: Hermann Fuchs
- Re: [Condor-users] Matching to not responding machines
  - From: Hermann Fuchs
- Re: [Condor-users] Matching to not responding machines
  - From: Rob de Graaf
- Re: [Condor-users] Matching to not responding machines
  - From: Hermann Fuchs

Prev by Date: Re: [Condor-users] Matching to not responding machines
Next by Date: Re: [Condor-users] Matching to not responding machines
Previous by thread: Re: [Condor-users] Matching to not responding machines
Next by thread: Re: [Condor-users] Matching to not responding machines
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Matching to not responding machines