[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [condor-igwn] STARTD_CRON module for node health check?



On Tue, 2025-01-21 at 05:29:35 -0600, Tim Theisen wrote:
> Hello Steffen,
> 
> I recommend the "errors=panic" mount option. That way the when the kernel
> detects the disk error, it takes the whole system down immediately. I found
> the technique useful in a previous job. Having a node take itself down was
> much better than having a disk go read only and have the node start behaving
> badly.

Thanks Tim,

I've also considered this, but this ("axe type") method would deprive me of
any possible information still available on the node. More often than not
I'd be able to log in, because only a subset of the local filesystems has
been set to ro, and I've seen cases where even HTCondor continued to write
to log files (in cache, just not synced back to disk).
Also a kernel panic would not switch off the failed node, to make sure the
panic message could still be read...

The disk issue is also only one kind of possible health problems I'd like to
handle better, thus I'll wait for more suggestions, which might be along the
line of "peaceful" service refusal (via START expression / classAds).

Of course the ultimate solution would be to have hardware that doesn't fail
at the rate we're currently seeing - it's rather unfortunate that very likely
sending the nodes to the vendor for inspection and repair wouldn't help as
that would leave the other half of the culprit (the chassis still nicely
hosting 3 more nodes) with us. We've seen more than once that replugging
the nodes would have a good chance to fix the issue...

Thanks,
 Steffen

> > it seems I'm in dire need of a health checker that can take execution nodes
> > out of HTCondor service (or completely down) faster, and more reliably, than
> > a human admin can.
> > I'm sure such a module already exists, so before I start to write one myself
> > (that's got to be somewhat resilient against sudden disk disconnects) I'm
> > asking here first.
> > I'd appreciate any pointers - this is driving me crazy, in particular as
> > I'm currently "grounded" by a virus infection and can't perform any manual
> > (as in hands-on, literally) maintenance to tame the misbehaving connectors.

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~