Additional HPC Cluster and HTC System servers down


Date: Thu, 02 Apr 2020 15:39:27 -0500
From: chtc-users@xxxxxxxxxxx
Subject: Additional HPC Cluster and HTC System servers down
Greetings CHTC users,

Additional CHTC services have been turned off due to an unexpected failure in the backup cooling system for the server room currently undergoing maintenance. In addition to our previously communicated outages (described in our original email, below), the following services are impacted:

High Performance Cluster
High Throughput Computing System
We don't yet know if the situation will improve to the point where we can turn certain key services back on. If any additional servers go down, or we're able to bring other servers back up, we will let you know via the chtc-users mailing list.

Again, please get in touch at chtc@xxxxxxxxxxx with any questions or concerns, especially if this outage means that you wonât make a hard deadline.

Best,
Your CHTC team

---------- Forwarded message ---------
From: chtc-users--- via CHTC-users <chtc-users@xxxxxxxxxxx>
Date: Wed, Apr 1, 2020 at 5:41 PM
Subject: Immediate: HPC Cluster and portions of HTC System down April 1 - 5
To: chtc-users <chtc-users@xxxxxxxxxxx>
Cc: <chtc-users@xxxxxxxxxxx>

Greetings CHTC users,

Due to a campus chilled-water maintenance announced this afternoon, CHTC needs to turn off major components of our computing services for the next 4 days (our server rooms depend heavily on chilled water for server cooling). Weâve already begun powering down a number of servers, with more to come as described for the below categories.

The HPC Cluster will be down:
  • All cluster execute servers will be turned off; no jobs will be able to run.
  • Jobs that were running on the cluster as of this afternoon (April 1) will be interrupted and re-queued to run again after the downtime.
  • HPC cluster users will still be able to log into the cluster head node (which will remain up) in order to access data. As a reminder, users should NEVER run computational work on the head nodes. Restrict all head node operations to data perusal and transfers to/from your non-CHTC storage. Users violating this policy will have their HPC Cluster login access deactivated for the remaining duration of the downtime.
Portions of the HTC System will be down:
  • Roughly  of the execute servers will be turned off. Any jobs running on these servers this afternoon (April 1) will be evicted, but will remain in the queue to re-run.
  • Many of our researcher-owned GPU servers are included in the group of execute servers that will be shut down.
  • The HTC System will otherwise continue to function normally (including SQUID, /staging, and the transfer server), albeit with a smaller number of execute servers. Users of the HTC system may see fewer jobs running than they would normally, with full throughput returning following the downtime.
There is a chance that we will need to turn off more servers; we will endeavor to provide immediate (or advanced) notice if this becomes necessary.

The campus maintenance is expected to conclude by Sunday, April 5 at 8pm CDT. We will send an email via this address (chtc-users@xxxxxxxxxxx) confirming when our systems are back online.

Please get in touch at chtc@xxxxxxxxxxx with any questions or concerns, especially if this outage means that you wonât make a hard deadline. Weâll do our best to help you with potential alternative solutions.

Best,
Your CHTC Team

[← Prev in Thread] Current Thread [Next in Thread→]
  • Additional HPC Cluster and HTC System servers down, chtc-users <=