HPC cluster update: additional downtime next week (3/26)


Date: Fri, 23 Mar 2018 15:30:06 -0500
From: chtc-users@xxxxxxxxxxx
Subject: HPC cluster update: additional downtime next week (3/26)
Greetings HPC Cluster Users,
(those who only use the HTC System can ignore the below)

The following announcement is for users of our HPC cluster (logging in to aci-service-1.chtc.wisc.edu or aci-service-2.chtc.wisc.edu). Due to persistent file system issues on the HPC cluster, we are planning a short downtime for the cluster next week (starting Tuesday or later) to upgrade the file system again.Â

As announced on Monday, there is an underlying file system issue that is causing the cluster nodes to be unstable. Our administrator has been investigating throughout the week but there is no obvious cause (including user behavior, quota enforcement, etc.) for this particular problem. The best short-term option for improving the cluster's performance is to update to a newer and better supported version of our file system software.Â

What you can expect / how to prepare:
  • The downtime will be next week, as early as Tuesday (3/27). We will send an email on Monday with exact dates.
  • We will maintain all data that is already on the cluster over the downtime so there won't be a need to transfer your files on and off the cluster, either before or after the downtime.
  • If you have already requested a quota increase, those changed quotas have been recorded and will be automatically applied on the upgraded filesystem. Â
  • Jobs that are running on the cluster when the downtime starts will be removed and need to be re-submitted when the cluster is back up.Â
  • Do not transfer any new data to the cluster over the weekend; the less we have on the system at the start of the downtime, the faster the whole process will go.Â
Thank you for your patience as we work through these unanticipated issues. If you have any additional questions or concerns, please email chtc@xxxxxxxxxxx

Best,
Your CHTC Team
[← Prev in Thread] Current Thread [Next in Thread→]
  • HPC cluster update: additional downtime next week (3/26), chtc-users <=