Re: Outage since late Friday, Nov 26; HPC Cluster and parts of the HTC System are still down


Date: Tue, 30 Nov 2021 01:22:52 +0000
From: chtc-users@xxxxxxxxxxx
Subject: Re: Outage since late Friday, Nov 26; HPC Cluster and parts of the HTC System are still down

Greetings,

 

We are happy to report that the HPC Cluster and affected HTC System execute servers are up and appear to be working appropriately. All of the affected servers are in a server room that is now on a backup chiller (the reason for the Nov 18 planned outage) until the main chiller is replaced in a few months, as a matter of regular maintenance. It was the backup chiller that experienced an unforeseen component failure on Friday night, causing this past weekend’s outage

 

Given the unusual rate of recent outages, we would like to remind all users to plan ahead for deadlines, maintain copies of essential data on non-CHTC systems, and copy output off of CHTC as soon as possible after jobs complete, so that it can be accessed even in the event of an outage. Servers in the HPC Cluster and HTC System are the first to be automatedly powered down in the event of limited power and cooling, as their function is less critically dependent on 24x7 operation than other servers and services in the server rooms we share.

 

While we don’t anticipate any additional near-term outages, it is certainly the case that outages take longer to recover from during holidays, when staffing may be limited. Please keep this in mind as we approach the winter break and its associated deadlines.

 

And as usual, continue to bug us with anything at all via chtc@xxxxxxxxxxx. We hope you had an enjoyable Thanksgiving holiday!

 

Cheers,

 

Lauren Michael

(on behalf of the CHTC Team)

 

Lauren Michael - Research Computing Facilitator, Center for High Throughput Computing, University of Wisconsin - Madison 

Research Facilitation Lead, Open Science Grid; co-PI, PATh; co-PI, CaRCC

lmichael@xxxxxxxx, go.wisc.edu/lmcal, Discovery 2262, (608)316-4430, she/her

 

From: CHTC-users <chtc-users-bounces@xxxxxxxxxxx> on behalf of chtc-users--- via CHTC-users <chtc-users@xxxxxxxxxxx>
Date: Monday, November 29, 2021 at 9:14 AM
To: CHTC Users <chtc-users@xxxxxxxxxxx>
Cc: chtc-users@xxxxxxxxxxx <chtc-users@xxxxxxxxxxx>
Subject: Outage since late Friday, Nov 26; HPC Cluster and parts of the HTC System are still down

Greetings,

 

As some of you noticed over the holiday weekend, the HPC Cluster and parts of the HTC System are down after new complications following the previous weekend’s planned maintenance. We’re currently working to get things back online as soon as we think they can be stably supported and will provide updates as we have them.

 

Affected systems are the same as for the recent planned maintenance (all in the same server room in the Discovery building):

  • HPC Cluster (entire; no login possible)
  • a portion of the HTC execute servers (including some researcher-owned and GPU hardware)

 

The HTC submit servers and majority of the HTC execute capacity are still up (nearly all are in another building). We are not yet certain of the state of the HPC Cluster queue.

 

We hope you had a nice Thanksgiving and understand the frustration of coming back to downed components. Thank you for your patience, and please contact us with any questions or issues at chtc@xxxxxxxxxxx

 

Regards,

Your CHTC Team

[← Prev in Thread] Current Thread [Next in Thread→]