Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [External] Re: STARTD_CRON module for node health check?

Date: Tue, 21 Jan 2025 15:36:54 +0000
From: "Pelletier, Michael V " <Michael.V.Pelletier@xxxxxxx>
Subject: Re: [HTCondor-users] [External] Re: STARTD_CRON module for node health check?

Hi, Steffen,

I set up a start expression mechanism that handles a variety of health checks on the nodes. The gist was to set up a "StartError" attribute and attach it to the STARTD_ATTRS list, and then within that expression query other attributes to arrive at a proper value for StartError.

Most system health query attributes are set up as a STARTD_CRON, to be run periodically to update the machine ad.

For example, one of the tests I set up was to run "ipmitool chassis status" to check for any power/cooling or disk faults. Back in RHEL5 days, this required setuidperl, but nowadays you'd probably use sudo for it, or set up /dev/ipmi0 to be readable by the Condor service account using a udev rule.

Here's the "condor_chassis" Perl script I wrote in 2013:

----------
#!/usr/bin/perl
$ENV{PATH} = '';
my @OUT;
if ($> == 0) {  # Need root access to read IPMI sensor data
        $< = 0;
        open(IPMI, "/usr/bin/ipmitool chassis status 2>/dev/null |");
        @OUT = <IPMI>;
        close(IMPI);
}
$< = 65534; $> = 65534; # Shed root privileges

my $fault = "UNDEFINED";
my $msg = ($#OUT > 0) ? "No error" : "No output from /usr/bin/ipmitool";
for(@OUT) {
        # Warn about disk faults
        if( m{(Disk|Drive)\s.*Fault\s*:\s*(\w+)}i ) {
                print "DiskFault = $2\nDiskFaultMsg = \"",
                        (($2 =~ m{true}i) ? "Disk fault reported" : "No error"),
                        "\"\n";
                next;
        }

        # Set power/cooling fault message to the first one found to be true
        if( !$found && m{(.* Fault|Overload)\s*:\s*(\w+)}i ) {
                $fault = lc($2);
                if ($fault eq "true") {
                        $msg = "$1 reported";
                        $found = 1;
                }
        }
}
print "PowerAndCoolingFault = $fault\nPowerAndCoolingFaultMsg = \"$msg\"\n--\n";
----------

As you can see, this looks for disk faults and power overload or other faults in the PSUs as reported by ipmitool. Note that it sheds root as quickly as possible and changes to the "nobody" ID to handle parsing the output. As a startd cron job, the output "DiskFault" and "PowerAndCoolingFault = true" (or false) are incorporated into the machine ad, along with their corresponding "Msg" attributes.

Here's the basic cron job setup:

-----
STARTD_CRON_JOBLIST     = $(STARTD_CRON_JOBLIST) chassis
STARTD_CRON_CHASSIS_EXECUTABLE  = $(SITE_LIBEXEC)/condor_chassis
STARTD_CRON_CHASSIS_MODE    = Periodic
STARTD_CRON_CHASSIS_PERIOD  = 10m
STARTD_CRON_CHASSIS_KILL    = True
STARTD_CRON_CHASSIS_RECONFIG_RERUN = True
-----

An early piece of the config.d initializes the StartError expression:

-----
StartError = ""
STARTD_ATTRS = $(STARTD_ATTRS) StartError
-----

Then, to apply this to the start expression, in the "chassis_status" Condor config.d file:

-----
START = $(START) && (DiskFault =!= True)
StartError = ifThenElse(DiskFault =?= True, DiskFaultMsg, $(StartError))

START = $(START) && (PowerAndCoolingFault =!= True)
StartError = ifThenElse(PowerAndCoolingFault =?= True, \
    PowerAndCoolingFaultMsg, $(StartError))
-----

The StartError, as more tests are applied, becomes a deeply-nested ifThenElse() expression, which sets it to the appropriate message based on which fault is raised. I wrote it before they added the ternary expression to ClassAd, so feel free to streamline it. I also wrote a condor_status query to show the StartError message of any impacted nodes.

The result, the ipmitool status is checked once every 10 minutes, and if a fault is detected, the "starterror" attribute of the node will report the first listed reason based on the sequence in which the StartError expression was built, and the START expression will go false to reject new matches.

I also set up other cron scripts for other system health checks.

For example, tracking disk free space on key volumes, e.g., condor_diskfree /proj/radar in a startd_cron would produce a DiskFree_proj_radar attribute and a DiskFreePct_proj_radar attribute, and the DiskFreeFault Boolean can set different free space criteria for different volumes: DiskFreeFault = (DiskFreePct_proj_radar <= 5) - and that attribute would go into the start expression to halt matches if that volume gets below 5% free. Note that if it's an NFS volume, that would put a halt to all matching on the entire pool unless you got more specific about it, so it's usually best to restrict this to local-disk filesystems only.

Another one was condor_checkproc, to confirm that the specified processes, in my case, automount, auditd, and the Centrify adclient, are in a good state.

Not all checks need a cron job - for example, to stop accepting jobs if the TotalLoadAvg is 20% over the number of available CPUs due to, say, NFS server not responding, or multiple jobs claiming all CPUs on the system in spite of their RequestCPUs, you can shut down new job acceptance. It uses TotalLoadAvg rather than CondorLoadAvg to be able to catch external problems as well as job-related problems.

-----
CpuIsOversubscribed = ( TotalLoadAvg > (TotalCpus * 1.20) )
CpuIsOversubscribedMsg = \
    "CPU is oversubscribed: TotalLoadAvg is over 120% of TotalCpus"

STARTD_ATTRS = $(STARTD_ATTRS) CpuIsOversubscribed

# Reject jobs when oversubscribed
START = $(START) && ( CpuIsOversubscribed =!= True )
StartError = ifThenElse( CpuIsOversubscribed, $(CpuIsOversubscribedMsg), \
                        $(StartError) )
-----

I hope this proves useful to you!

Michael Pelletier
Principal Technologist
High Performance Computing
Classified Infrastructure Services

C: +1 339.293.9149
michael.v.pelletier@xxxxxxx

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Tim Theisen via HTCondor-users
Sent: Tuesday, January 21, 2025 6:30 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>; condor-igwn@xxxxxxxxxxxxxx
Cc: Tim Theisen <tim@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] STARTD_CRON module for node health check?

Hello Steffen,

I recommend the "errors=panic" mount option. That way the when the kernel detects the disk error, it takes the whole system down immediately. I found the technique useful in a previous job. Having a node take itself down was much better than having a disk go read only and have the node start behaving badly.

...Tim

On 1/21/25 02:07, Steffen Grunewald wrote:
> Good morning,
>
> it seems I'm in dire need of a health checker that can take execution
> nodes out of HTCondor service (or completely down) faster, and more
> reliably, than a human admin can.
>
> In particular, I've been facing read-only disks caused by ageing
> connectors failing due to thermal/mechanical stress.
> These nodes are 2U4N nodes which add extra connectors along the data
> path to the disks kept in the common enclosure.
>
> I'm sure such a module already exists, so before I start to write one
> myself (that's got to be somewhat resilient against sudden disk
> disconnects) I'm asking here first.
> Fortunately it seems that executables invoked often enough may be kept
> in page cache and would be accessed from there _even if the disk is
> gone_, but such a module should avoid writing to $TMPDIR etc for
> obvious reasons while still changing a crucial attribute (that would
> go into the START expression?) or using other means ("ipmitool power
> off" would be one of the axe type) to disconnect/disable the "black hole" node.
>
> I'd appreciate any pointers - this is driving me crazy, in particular
> as I'm currently "grounded" by a virus infection and can't perform any
> manual (as in hands-on, literally) maintenance to tame the misbehaving connectors.
>
> Thanks so far,
> keep safe,
>    Steffen
>
--
Tim Theisen (he, him, his)
Release Manager
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin - Madison
4261 Computer Sciences and Statistics
1210 W Dayton St
Madison, WI 53706-1685
+1 608 265 5736

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://urldefense.us/v2/url?u=https-3A__www-2Dauth.cs.wisc.edu_lists_htcondor-2Dusers_&d=DwICAg&c=MASr1KIcYm9UGIT-jfIzwQg1YBeAkaJoBtxV_4o83uQ&r=4PJgb1eyyvhzSV4fRwSECGK3jb50YP8vZUAedXybzgaNykar_o0SxKOUPkRHE0WG&m=5gUyKsV_g0CgofJEexBxvLulGOThdjphhpnOJtaJgA2IlHcA7zwuwvxT17DlajOS&s=H9W3F6X0kLnkk52vlOPLTTb0V4Kt_o3JrbJVCKOInho&e=

References:
- [HTCondor-users] STARTD_CRON module for node health check?
  - From: Steffen Grunewald
- Re: [HTCondor-users] STARTD_CRON module for node health check?
  - From: Tim Theisen

Prev by Date: Re: [HTCondor-users] htcondor Python module version 10.9.0 has vanished from PyPI
Next by Date: Re: [HTCondor-users] condor_rm job, Permission denied to force removal
Previous by thread: Re: [HTCondor-users] [condor-igwn] STARTD_CRON module for node health check?
Next by thread: [HTCondor-users] What causes (almost simultaneous) slot re-use?
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] [External] Re: STARTD_CRON module for node health check?