Immediate: HPC Cluster and portions of HTC System down for 3-4 days


Date: Tue, 07 May 2019 10:33:19 -0500
From: chtc-users@xxxxxxxxxxx
Subject: Immediate: HPC Cluster and portions of HTC System down for 3-4 days

Greetings CHTC users,


Due to a campus chilled-water maintenance announced just this morning, CHTC needs to turn off major components of our computing services for the next 3-4 days (our server rooms depend heavily on chilled water for server cooling). Weâve already begun powering down a number of servers, with more to come as described for the below categories.


The HPC Cluster will be down:

  • All cluster execute servers will be turned off; no jobs will be able to run.

  • Jobs that were running on the cluster as of this morning will be interrupted and re-queued to run again after the downtime.

  • HPC cluster users will still be able to log into the cluster head node (which will remain up) in order to access data. As a reminder, users should NEVER run computational work on the head nodes. Restrict all head node operations to data perusal and transfers to/from your non-CHTC storage. Users violating this policy will have their HPC Cluster login access deactivated for the remaining duration of the downtime.


Portions of the HTC System will be down:

  • Roughly  of the execute servers will be turned off. Any jobs running on these servers this morning will be evicted, but will remain in the queue to re-run.

  • The HTC System will otherwise continue to function normally (including SQUID, Gluster, and the transfer server), albeit with a smaller number of execute servers. Users of the HTC system may see fewer jobs running than they would normally, with full throughput returning following the downtime.


There is a chance that we will need to turn off more servers; we will endeavor to provide immediate (or advanced) notice if this becomes necessary.


The campus maintenance is expected to conclude by Saturday. We will send an email via this address (chtc-users@xxxxxxxxxxx) confirming when our systems are back online.


Please get in touch at chtc@xxxxxxxxxxx with any questions or concerns, especially if this outage means that you wonât make a hard deadline. Weâll do our best to help you with potential alternative solutions.


Best,

Your CHTC Team

[← Prev in Thread] Current Thread [Next in Thread→]
  • Immediate: HPC Cluster and portions of HTC System down for 3-4 days, chtc-users <=