Re: Immediate: HPC Cluster and portions of HTC System down for 3-4 days


Date: Fri, 10 May 2019 21:15:47 -0500
From: chtc-users@xxxxxxxxxxx
Subject: Re: Immediate: HPC Cluster and portions of HTC System down for 3-4 days
Greetings,

The HPC Cluster is up and running queued jobs, again.

Down components of the HTC System are being restored more gradually, but we expect all to be up by sometime Monday. As a reminder, there is still a brief, scheduled downtime for May 16 for some components of the HTC System. (See email from yesterday.)

As always, please get in touch with any questions or concerns.

Thank you,
Your CHTC Team

On Tue, May 7, 2019 at 15:36 <chtc-users@xxxxxxxxxxx> wrote:
Greetings CHTC users,

Our apologies if you are getting this email twice, but our initial notification sent out this morning wasn't delivered to everyone.Â

In short: we turned off the HPC cluster and some of our HTC execute nodes this morning due to cooling-related issues. See our original email below (or http://chtc.cs.wisc.edu/user-news.shtml) for details.Â

Cheers,
Your CHTC team

---------- Forwarded message ---------
From: Christina Koch <ckoch5@xxxxxxxx>
Date: Tue, May 7, 2019 at 10:33 AM
Subject: Immediate: HPC Cluster and portions of HTC System down for 3-4 days
To: chtc-users <chtc-users@xxxxxxxxxxx>


Greetings CHTC users,


Due to a campus chilled-water maintenance announced just this morning, CHTC needs to turn off major components of our computing services for the next 3-4 days (our server rooms depend heavily on chilled water for server cooling). Weâve already begun powering down a number of servers, with more to come as described for the below categories.


The HPC Cluster will be down:

  • All cluster execute servers will be turned off; no jobs will be able to run.

  • Jobs that were running on the cluster as of this morning will be interrupted and re-queued to run again after the downtime.

  • HPC cluster users will still be able to log into the cluster head node (which will remain up) in order to access data. As a reminder, users should NEVER run computational work on the head nodes. Restrict all head node operations to data perusal and transfers to/from your non-CHTC storage. Users violating this policy will have their HPC Cluster login access deactivated for the remaining duration of the downtime.


Portions of the HTC System will be down:

  • Roughly  of the execute servers will be turned off. Any jobs running on these servers this morning will be evicted, but will remain in the queue to re-run.

  • The HTC System will otherwise continue to function normally (including SQUID, Gluster, and the transfer server), albeit with a smaller number of execute servers. Users of the HTC system may see fewer jobs running than they would normally, with full throughput returning following the downtime.


There is a chance that we will need to turn off more servers; we will endeavor to provide immediate (or advanced) notice if this becomes necessary.


The campus maintenance is expected to conclude by Saturday. We will send an email via this address (chtc-users@xxxxxxxxxxx) confirming when our systems are back online.


Please get in touch at chtc@xxxxxxxxxxx with any questions or concerns, especially if this outage means that you wonât make a hard deadline. Weâll do our best to help you with potential alternative solutions.


Best,

Your CHTC Team

_______________________________________________
CHTC-users mailing list
CHTC-users@xxxxxxxxxxx
To unsubscribe send an email to:
chtc@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/chtc-users
--
Sent from Gmail Mobile
[← Prev in Thread] Current Thread [Next in Thread→]