Greetings,
Due to just-confirmed maintenance for the cooling infrastructure in one of CHTC’s server rooms, we will experience full HPC Cluster and partial HTC System outages beginning in the
afternoon on Thursday, November 18, with service being restored by Monday, November 22.
Impacts to the HPC Cluster
All hardware (head nodes, execute nodes, storage) in the HPC cluster will be powered down during the planned outage.
To prevent HPC Cluster jobs from being interrupted by the downtime, we will begin draining the nodes one week prior to the downtime.
Jobs submitted requesting time that would exceed the November 18 downtime
will not run until after the cluster is back up, but will be accepted into the queue. Jobs can still run on the cluster within the week before the downtime, IF their time request (“--time=” in the submit file) indicates that they will complete before the morning
of November 18.
Impacts to the HTC System
The following components of the HTC system will be powered down during the outage:
-
a subset of HTC execute nodes
-
the following submit servers may go down (and would likely be inaccessible for through Nov 22), but we hope to keep them up: submit2.chtc.wisc.edu, submit3.chtc.wisc.edu, learn.chtc.wisc.edu
While jobs on the affected submit servers and execute servers will be interrupted when they go down, they will remain in the queue to run again once the submit servers are back up.
Otherwise, HTC users should not be impacted by this outage.
It is possible the exact dates of the outage may shift, and we realize this is somewhat short notice, but plan to provide a reminder or update at least one day prior to the start
of the downtime.
Please contact us at
chtc@xxxxxxxxxxx with any questions or concerns.
Best,
Your CHTC team
lmichael@xxxxxxxx,
go.wisc.edu/lmcal, Discovery 2262, (608)316-4430, she/her