Outage extended to submit2/3; flocking/gliding jobs with HTTP transfer may go on hold


Date: Fri, 19 Nov 2021 15:57:12 +0000
From: chtc-users@xxxxxxxxxxx
Subject: Outage extended to submit2/3; flocking/gliding jobs with HTTP transfer may go on hold

Hello Again,

 

While the outage of the HPC Cluster has gone as planned, additional servers in the HTC System had to be taken down just this morning to preserve server and room temperatures during the cooling maintenance:

 

  • The submit2.chtc.wisc.edu and submit3.chtc.wisc.edu servers are now down, likely through the weekend. Running jobs in the queue will be interrupted, but the queue of jobs will be preserved such that queued jobs will run again after the downtime.
  • HTC jobs submitted with WantFlocking and/or WantGlidein that include HTTP file transfer may go on hold when running outside of CHTC’s pool (due to failed transfer of HTTP data). We are currently working to prevent jobs from continuing to run outside of CHTC, and users can otherwise release jobs held for this reason. Please contact us if you are unsure of how to check hold reasons or how to release your jobs to run again.

 

We will provide further updates as we have them and apologize for the additional inconvenience. We still anticipate restored service by Monday, November 22. Per usual, please send any questions for urgent issues to chtc@xxxxxxxxxxxx

 

Thank you,

Your CHTC Team

 

From: CHTC-users <chtc-users-bounces@xxxxxxxxxxx> on behalf of chtc-users--- via CHTC-users <chtc-users@xxxxxxxxxxx>
Date: Wednesday, November 17, 2021 at 12:37 PM
To: CHTC Users <chtc-users@xxxxxxxxxxx>
Cc: chtc-users@xxxxxxxxxxx <chtc-users@xxxxxxxxxxx>
Subject: Reminder: Full HPC and Partial HTC Outages Nov 18 - Nov 22

Hello Again,

 

This is just a reminder that the HPC Cluster and some HTC System components will go down starting tomorrow, November 18, with service to be fully restored by Monday. Full details are further below, in our original announcement.

 

Thank you,

Your CHTC Team

 

From: CHTC-users <chtc-users-bounces@xxxxxxxxxxx> on behalf of chtc-users--- via CHTC-users <chtc-users@xxxxxxxxxxx>
Date: Wednesday, November 10, 2021 at 9:45 AM
To: CHTC Users <chtc-users@xxxxxxxxxxx>
Cc: chtc-users@xxxxxxxxxxx <chtc-users@xxxxxxxxxxx>
Subject: Full HPC and Partial HTC Outages Nov 18 - Nov 22

Greetings,

 

Due to just-confirmed maintenance for the cooling infrastructure in one of CHTC’s server rooms, we will experience full HPC Cluster and partial HTC System outages beginning on Thursday, November 18, with service being restored by Monday, November 22.

 

Impacts to the HPC Cluster

All hardware (head nodes, execute nodes, storage) in the HPC cluster will be powered down during the planned outage.

To prevent HPC Cluster jobs from being interrupted by the downtime, we will begin draining the nodes one week prior to the downtime. Jobs submitted requesting time that would exceed the November 18 downtime will not run until after the cluster is back up, but will be accepted into the queue. Jobs can still run on the cluster within the week before the downtime, IF their time request (“--time=” in the submit file) indicates that they will complete before the morning of November 18.

 

Impacts to the HTC System

The following components of the HTC system will be powered down during the outage: 

  • a subset of HTC execute nodes
  • the following submit servers may go down (and would likely be inaccessible for through Nov 22), but we hope to keep them up: submit2.chtc.wisc.edu, submit3.chtc.wisc.edu, learn.chtc.wisc.edu

While jobs on the affected submit servers and execute servers will be interrupted when they go down, they will remain in the queue to run again once the submit servers are back up. Otherwise, HTC users should not be impacted by this outage. 

 

It is possible the exact dates of the outage may shift, and we realize this is somewhat short notice, but plan to provide a reminder or update at least one day prior to the start of the downtime.

 

Please contact us at chtc@xxxxxxxxxxx with any questions or concerns. 

 

Best, 

Your CHTC team

 

 

CARE OF:

Lauren Michael - Research Computing Facilitator, Center for High Throughput Computing, University of Wisconsin - Madison 

Research Facilitation Lead, Open Science Grid; co-PI, PATh; co-PI, CaRCC

lmichael@xxxxxxxx, go.wisc.edu/lmcal, Discovery 2262, (608)316-4430, she/her

[← Prev in Thread] Current Thread [Next in Thread→]