Datacenter outage overnight: HPC Cluster and some of the HTC System affected


Date: Thu, 01 Nov 2018 09:46:35 -0500
From: chtc-users@xxxxxxxxxxx
Subject: Datacenter outage overnight: HPC Cluster and some of the HTC System affected
Greetings CHTC Users,

One of CHTC's datacenters experienced a power outage that ended up shutting down many servers in the HPC Cluster and some of the HTC System's execute servers. We are in the process of ensuring that everything is powered on appropriately, but users should be prepared to take any actions according to the below:

HTC System:
  • Any jobs interrupted when running on an affected execute server will have remained in the queue to be re-run on another execute server.
  • All jobs running on UW-Madison's OSG pool (via WantGlidein) were interrupted, and it may not be until later today that we can restore OSG availability (users of OSG may see fewer jobs running until then).
HPC Cluster
  • All running jobs will have been interrupted, though we believe all jobs in the queue should still be in the queue (re-queued) when the HPC Cluster is appropriately rebooted.
  • We'll send another email when we're sure the cluster is fully functional.

Also, we apologize for the coincidence of multiple emails this week. Thank you for your patience, and get in touch with any questions via chtc@xxxxxxxxxxx

Sincerely,
Your CHTC Team

Lauren Michael -ÂResearch Computing Facilitator,ÂCenter for High Throughput ComputingUniversity of Wisconsin - Madison
lmichael@xxxxxxxxtinyurl.com/LMichaelCalendarDiscovery 2262, (608)316-4430
[← Prev in Thread] Current Thread [Next in Thread→]
  • Datacenter outage overnight: HPC Cluster and some of the HTC System affected, chtc-users <=