HPC Cluster queue restored; users advised to proceed with caution

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date:	Wed, 2 Dec 2020 17:22:26 -0600
From:	chtc-users@xxxxxxxxxxx
Subject:	HPC Cluster queue restored; users advised to proceed with caution

Hello again,

The cluster queue and Slurm functions have been restored; the incident was traced to a malfunctioning Infiniband switch. While some jobs continued to run during the downtime and others have begun running again, others may have failed and left the queue. Users are advised to review their error/output files and the queue to determine whether any jobs will need to be resubmitted.

While we believe full network capabilities are restored, the cluster is at reduced capacity while we work to reinstate some nodes (marked as 'down' in Slurm's 'sinfo' command output). Additionally, we would like to caution users that cluster functionality is at risk for lower reliability, at least until we can observe stable behavior of the affected hardware over the coming hours and days.

As always, if you notice any errors that youâre unsure of how to address, please send an email to chtc@xxxxxxxxxxx with details.

Thank you,
Your CHTC Team

On Wed, Dec 2, 2020 at 1:34 PM chtc-users--- via CHTC-users <chtc-users@xxxxxxxxxxx> wrote:

Greetings,

This message is for users of CHTC's HPC Cluster. Users of only the HTC System can ignore.

We are currently working to understand and fix a networking issue affecting many of the execute nodes in the HPC Cluster, as well as the server that operates the queue. As a result of this outage, the cluster's queue and all Slurm commands are failing, though users are still able to log into the main head node (hpclogin1.chtc.wisc.edu). The full extent of impact to queued jobs is yet unclear.

While we are still investigating on-site, we are unsure of how long it will take to diagnose and fix the issue, and to restore the cluster to functionality. We appreciate your patience, and will provide updates with any changes to functionality or timeline.

Thank you,
Your CHTC Team
_______________________________________________
CHTC-users mailing list
CHTC-users@xxxxxxxxxxx
To unsubscribe send an email to:
chtc@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/chtc-users

[← Prev in Thread]	Current Thread	[Next in Thread→]
HPC Cluster queue restored; users advised to proceed with caution, chtc-users <=

Previous by Date:	HPC Cluster queue and execute nodes are down due to network issues, chtc-users
Next by Date:	Partial Outage of the HTC System on Tuesday, Dec 15, chtc-users
Previous by Thread:	HPC Cluster queue and execute nodes are down due to network issues, chtc-users
Next by Thread:	Partial Outage of the HTC System on Tuesday, Dec 15, chtc-users
Indexes:	[Date] [Thread]

Mailing List Archives

Authenticated access

HPC Cluster queue restored; users advised to proceed with caution