[Chtc-users] All CHTC functionality restored!!


Date: Thu, 29 Jan 2015 16:30:09 -0600
From: chtc-users@xxxxxxxxxxx
Subject: [Chtc-users] All CHTC functionality restored!!
Greetings!

The HTC submit node, submit-3, and all other CHTC compute systems are once again fully functional!!Â
Users may now return to all regular computational activity via the CHTC's HTC submit nodes and HPC cluster head nodes.

Thank you for your patience during the urgent reboot of all of CHTC's servers within the last day, and especially for the patience of those using submit-3.chtc.wisc.edu, which required some additional testing this afternoon.Â

Please continue to let CHTC staff know if/when you experience difficulties in using our compute systems. Hopefully, we won't have to bother you with emails for a while ...

Happy Computing,
Your CHTC Team


On Wed, Jan 28, 2015 at 4:29 PM, Lauren Michael <lmichael@xxxxxxxx> wrote:
The HPC Cluster is back online after the necessary reboot (see further below).

As a reminder ALL jobs from the HPC Cluster will need to be resubmitted, as they are lost from the SLURM queue upon reboot.

Thank you,
Your CHTC Team


On Wed, Jan 28, 2015 at 2:06 PM, <chtc-users@xxxxxxxxxxx> wrote:
ATTENTION CHTC Users!!

Due to very recent information on a critical vulnerability in the operating systems we use for CHTC compute servers,
ALL CHTC SERVERS NEED TO BE REBOOTED TODAY (see below)


For CHTC's HTC System (HTCondor Pool via submit nodes):
The process to reboot all servers has already begun, and will take place over the next 24 hours due to the large number of servers.

What HTC users can expect:
  • Temporary delays in access to submit servers during their reboot (planned for early tomorrow).
  • Interruption of running jobs as execute servers are automatedly rebooted over the next 24 hours. Interrupted jobs WILL continue to be tracked and will be re-run by HTCondor.
  • Delays in the running of newly-submitted jobs until all reboots are complete.

For CHTC's HPC Cluster (via head node: aci-service-1.chtc.wisc.edu):
The HPC Cluster will be rebooted at 3pm today, and brought back ASAP after that point.

What HPC users can expect:
  • loss of SSH access to cluster head nodes (aci-service-1/2) during the reboot.
  • JOBS WILL BE LOST AND NEED TO BE REBOOTED, as SLURM cannot recover jobs upon reboot.

CHTC staff will send emails when the reboot processes have completedÂand compute system functionality is restored. The security vulnerability applies to all RedHat-based Linux operating systems, including the Scientific Linux operating system we use in CHTC. The security of your work is of utmost importance to CHTC, and this specific vulnerability requires immediate action.

The timing of the security vulnerability and CHTC-wide reboot are completely unrelated to the previously-described downtime for /mnt/gluster and high-memory servers in the HTC System that was necessary this morning. We apologize for any interruption to your CHTC research!

Thank you,
Your CHTC Team


(care of)
Lauren Michael - Research Computing Facilitator,ÂUniversity of Wisconsin - Madison

_______________________________________________
Chtc-users mailing list
Chtc-users@xxxxxxxxxxx
To unsubscribe send an email to:
chtc@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/chtc-users



[← Prev in Thread] Current Thread [Next in Thread→]
  • [Chtc-users] All CHTC functionality restored!!, chtc-users <=