Good morning,
it seems I'm in dire need of a health checker that can take execution nodes
out of HTCondor service (or completely down) faster, and more reliably, than
a human admin can.
In particular, I've been facing read-only disks caused by ageing connectors
failing due to thermal/mechanical stress.
These nodes are 2U4N nodes which add extra connectors along the data path
to the disks kept in the common enclosure.
I'm sure such a module already exists, so before I start to write one myself
(that's got to be somewhat resilient against sudden disk disconnects) I'm
asking here first.
Fortunately it seems that executables invoked often enough may be kept in
page cache and would be accessed from there _even if the disk is gone_, but
such a module should avoid writing to $TMPDIR etc for obvious reasons while
still changing a crucial attribute (that would go into the START expression?)
or using other means ("ipmitool power off" would be one of the axe type) to
disconnect/disable the "black hole" node.
I'd appreciate any pointers - this is driving me crazy, in particular as
I'm currently "grounded" by a virus infection and can't perform any manual
(as in hands-on, literally) maintenance to tame the misbehaving connectors.
Thanks so far,
keep safe,
Steffen