Numerous storm-induced CHTC outages


Date: Mon, 13 Jun 2022 20:28:47 +0000
From: chtc-users@xxxxxxxxxxx
Subject: Numerous storm-induced CHTC outages

Greetings,

 

Numerous servers in CHTC’s HTC System and and the entirety of the HPC Cluster went down in a fairly disruptive way because of power outages to campus buildings during this afternoon’s storm. We are working to restore functionality and will provide updates as we can, beyond the below general expectations:

 

All jobs running on the HPC Cluster and most running in the HTC System will have been interrupted. While queued jobs on the HTC System will remain in the queue to run again, interrupted jobs on the HPC Cluster may need to be resubmitted.

 

As we bring up HTC submit nodes and the HPC Cluster head nodes, users are welcome to log in and clean up incomplete data and remove jobs. However, please know that there may be additional interruptions (especially on the HPC Cluster) or missing functionality as we ensure that servers are rebooted in a proper state. Additionally, it may take time to restore the full capacity of down execute nodes.

 

More updates to come.

 

Thank you,

Your CHTC Team

 

Lauren Michael - Research Computing Facilitator, Center for High Throughput Computing, University of Wisconsin - Madison 

Research Facilitation Lead, Open Science Grid; co-PI, PATh; co-PI, CaRCC

lmichael@xxxxxxxx, go.wisc.edu/lmcal, Discovery 2262, (608)316-4430, she/her

[← Prev in Thread] Current Thread [Next in Thread→]