Re: Discovery network outage affecting HTC System and HPC Cluster


Date: Thu, 28 Mar 2019 06:54:38 -0500
From: chtc-users@xxxxxxxxxxx
Subject: Re: Discovery network outage affecting HTC System and HPC Cluster
An update:

The Discovery building network was restored overnight. We will be assessing any remaining issues this morning, but the HPC Cluster is accessible again, and HTC execute servers in Discovery have been accepting jobs again.

It is likely that some HPC Cluster jobs will need to be resubmitted, Additionally, HTC jobs using Gluster when the outage first occurred likely failed and will need to be resubmitted.

As always, please email chtc@xxxxxxxxxxx if there are issues that you are not sure how to address.Â

Thank you,
CHTC

On Wed, Mar 27, 2019 at 17:16 <chtc-users@xxxxxxxxxxx> wrote:
For All CHTC Users:

Since roughly 2:15pm, today, there has been an unplanned network outage in the Discovery building, where the HPC Cluster and portions of the HTC System are located. Though we are not readily able to log into some servers to confirm the full extent of interruptions to CHTC services, the following (at least) will be affected:
  • The HPC Cluster is still inaccessible for user login during the outage, but may continue to run jobs; however, jobs making connections outside the cluster (moving data to/from other campus or off-campus locations) will likely fail and will have to be resubmitted.
  • Jobs running onÂsomeÂHTC execute servers will have been interrupted, though HTCondor will re-queue them (in the Idle state) and run them on another server when possible. Execute servers in the Discovery building will not be able to accept new jobs during the outage, and jobs already running on most HTC System servers will continue normally.
Generally. the HTC System submit servers and transfer servers are still accessible via SSH because they are not in the Discovery building, and new jobs can still be submitted to the HTC System.

There may be additional issues introduced when the Discovery network comes back online, and we do not yet have information on when that will happen. We'll provide an update after we've had a chance to understand additional impacts and/or any necessary user actions following the outage.

Thank you for your patience, and continue to let us know if you have questions or issues.Â

Your CHTC Team
_______________________________________________
CHTC-users mailing list
CHTC-users@xxxxxxxxxxx
To unsubscribe send an email to:
chtc@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/chtc-users
--
Sent from Gmail Mobile
[← Prev in Thread] Current Thread [Next in Thread→]