Hi Hermann, On 03/28/2012 11:32 AM, Hermann Fuchs wrote:
However, I would like to implement some kind of a failure detection for the running grid as network problems will and do occur. Is there a query which is only answered when the machines do communicate? condor_status seems to be misleading, the machines listed there which stopped communicating remain there in some cases (e.g. the mentioned case).
You could use INVALIDATE_STARTD_ADS (man condor_advertise) to make the collector forget about specific machines. You would need to know which machines to invalidate. The only way I can think of right now is to ask them directly (condor_status -direct or maybe condor_config_val) and check the exit status of those commands. The downside of this approach is that you will have to endure a timeout for every machine that has the problem. If you have hundreds or thousands of machines, it will quickly become unfeasible.
Alternatively, you could tweak CLASSAD_LIFETIME on the collector to make it forget about unresponsive machines more quickly, but it might also accidentally invalidate working machines if any updates get lost on the network. See: http://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#SECTION004316000000000000000
Regards, Rob