Greetings,
We are happy to report that the HPC Cluster and affected HTC System execute servers are up and appear to be working appropriately. All of the affected servers are in a server room that is now on a backup chiller
(the reason for the Nov 18 planned outage) until the main chiller is replaced in a few months, as a matter of regular maintenance. It was the backup chiller that experienced an unforeseen component failure on Friday night, causing this past weekend’s outage
Given the unusual rate of recent outages, we would like to remind all users to plan ahead for deadlines, maintain copies of essential data on non-CHTC systems, and copy output off of CHTC as soon as
possible after jobs complete, so that it can be accessed even in the event of an outage. Servers in the HPC Cluster and HTC System are the first to be automatedly powered down in the event of limited power and cooling, as their function is less critically
dependent on 24x7 operation than other servers and services in the server rooms we share.
While we don’t anticipate any additional near-term outages, it is certainly the case that outages take longer to recover from during holidays, when staffing may be limited. Please keep this in mind as we approach
the winter break and its associated deadlines.
And as usual, continue to bug us with anything at all via
chtc@xxxxxxxxxxx. We hope you had an enjoyable Thanksgiving holiday!
Cheers,
Lauren Michael
(on behalf of the CHTC Team)
lmichael@xxxxxxxx,
go.wisc.edu/lmcal, Discovery 2262, (608)316-4430, she/her
From:
CHTC-users <chtc-users-bounces@xxxxxxxxxxx> on behalf of chtc-users--- via CHTC-users <chtc-users@xxxxxxxxxxx>
Date: Monday, November 29, 2021 at 9:14 AM
To: CHTC Users <chtc-users@xxxxxxxxxxx>
Cc: chtc-users@xxxxxxxxxxx <chtc-users@xxxxxxxxxxx>
Subject: Outage since late Friday, Nov 26; HPC Cluster and parts of the HTC System are still down
Greetings,
As some of you noticed over the holiday weekend, the HPC Cluster and parts of the HTC System are down after new complications following the previous weekend’s planned maintenance. We’re currently working to
get things back online as soon as we think they can be stably supported and will provide updates as we have them.
Affected systems are the same as for the recent planned maintenance (all in the same server room in the Discovery building):
- HPC Cluster (entire; no login possible)
- a portion of the HTC execute servers (including some researcher-owned and GPU hardware)
The HTC submit servers and majority of the HTC execute capacity are still up (nearly all are in another building). We are not yet certain of the state of the HPC Cluster queue.
We hope you had a nice Thanksgiving and understand the frustration of coming back to downed components. Thank you for your patience, and please contact us with any questions or issues at
chtc@xxxxxxxxxxx
Regards,
Your CHTC Team