Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [condor-igwn] STARTD_CRON module for node health check?

Date: Thu, 6 Feb 2025 09:20:01 +0100
From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
Subject: Re: [HTCondor-users] [condor-igwn] STARTD_CRON module for node health check?

On Tue, 2025-01-21 at 13:10:36 +0100, Steffen Grunewald wrote:
> On Tue, 2025-01-21 at 05:29:35 -0600, Tim Theisen wrote:
> > Hello Steffen,
> > 
> > I recommend the "errors=panic" mount option. That way the when the kernel
> > detects the disk error, it takes the whole system down immediately. I found
> > the technique useful in a previous job. Having a node take itself down was
> > much better than having a disk go read only and have the node start behaving
> > badly.
> 
> Thanks Tim,
> 
> I've also considered this, but this ("axe type") method would deprive me of
> any possible information still available on the node. More often than not
> I'd be able to log in, because only a subset of the local filesystems has
> been set to ro, and I've seen cases where even HTCondor continued to write
> to log files (in cache, just not synced back to disk)

Hi all,

while still searching for the "ultimate" way of detecting "black holes" in the
pool, I found another type of information that might be useful.
Nodes that failed to run jobs in the way we found would keep advertising their
slot(s). As a result, multiple jobs would be assigned to the same slot - and
this situation can be detected by running something like

  condor_q -glo -run -all | grep @ | cut -d@ -f2 | sort | uniq -c

- nodes affected by the problem would show up multiple times.
Since this situation can be detected on the master (without logging into each
and every EP) it's also faster and the EP list gained from it can be used in
multiple ways - in the worst case to fire off some "ipmitool power off" over
the network. And it's cron'able.
Still it will take some time for the collector to find that its info about
the nodes and jobs affected is stale, but that's a matter of minutes.

Maybe this helps someone - it certainly helps me to keep my users off the
phone.

Best,
 Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

Prev by Date: Re: [HTCondor-users] More job transform questions
Next by Date: [HTCondor-users] GPU sort order, Re: Make only one GPU available to HTCondor?
Previous by thread: Re: [HTCondor-users] More job transform questions
Next by thread: [HTCondor-users] GPU sort order, Re: Make only one GPU available to HTCondor?
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] [condor-igwn] STARTD_CRON module for node health check?