Date: | Wed, 2 Dec 2020 17:22:26 -0600 |
---|---|
From: | chtc-users@xxxxxxxxxxx |
Subject: | HPC Cluster queue restored; users advised to proceed with caution |
Hello again, The cluster queue and Slurm functions have been restored; the incident was traced to a malfunctioning Infiniband switch. While some jobs continued to run during the downtime and others have begun running again, others may have failed and left the queue. Users are advised to review their error/output files and the queue to determine whether any jobs will need to be resubmitted. While we believe full network capabilities are restored, the cluster is at reduced capacity while we work to reinstate some nodes (marked as 'down' in Slurm's 'sinfo' command output). Additionally, we would like to caution users that cluster functionality is at risk for lower reliability, at least until we can observe stable behavior of the affected hardware over the coming hours and days. As always, if you notice any errors that youâre unsure of how to address, please send an email to chtc@xxxxxxxxxxx with details. Thank you, Your CHTC Team On Wed, Dec 2, 2020 at 1:34 PM chtc-users--- via CHTC-users <chtc-users@xxxxxxxxxxx> wrote:
|
[← Prev in Thread] | Current Thread | [Next in Thread→] |
---|---|---|
|
Previous by Date: | HPC Cluster queue and execute nodes are down due to network issues, chtc-users |
---|---|
Next by Date: | Partial Outage of the HTC System on Tuesday, Dec 15, chtc-users |
Previous by Thread: | HPC Cluster queue and execute nodes are down due to network issues, chtc-users |
Next by Thread: | Partial Outage of the HTC System on Tuesday, Dec 15, chtc-users |
Indexes: | [Date] [Thread] |