Jan 10 Downtime Concluded


Date: Tue, 11 Jan 2022 17:57:11 +0000
From: chtc-users@xxxxxxxxxxx
Subject: Jan 10 Downtime Concluded

Hi Everyone,

 

After bringing up all components by early yesterday evening, everything appears stable.

 

We are otherwise working to address some minor issues relevant to accessing the HTC System’s ‘staging’ filesystem from submit servers, which look like they arose over the weekend (prior to the planned outage). These filesystems are otherwise working, just not yet accessible via some CHTC servers, which we hope to have fully addressed by the end of the day.

 

Please send email to chtc@xxxxxxxxxxx with any issues or questions, and we’ll be happy to answer them.

 

And thank you, again, for your patience,

Your CHTC Team

 

From: CHTC-users <chtc-users-bounces@xxxxxxxxxxx> on behalf of chtc-users--- via CHTC-users <chtc-users@xxxxxxxxxxx>
Date: Thursday, January 6, 2022 at 9:17 AM
To: CHTC Users <chtc-users@xxxxxxxxxxx>
Cc: chtc-users@xxxxxxxxxxx <chtc-users@xxxxxxxxxxx>
Subject: Updated: HPC Cluster and some HTC components down for maintenance on Monday (1/10)

An update:

 

It looks like the maintenance can wait until this Monday, January 10. We will begin taking servers down at 8am that day, and plan for the downtime to last most of the day, updating users with any changes. The HPC Cluster configuration has been updated to allow jobs to run that will complete by this time (based upon their ‘time’ requirement). All other details below still apply.

 

Thank you, again,

Your CHTC Team

 

From: CHTC-users <chtc-users-bounces@xxxxxxxxxxx> on behalf of chtc-users--- via CHTC-users <chtc-users@xxxxxxxxxxx>
Date: Wednesday, January 5, 2022 at 1:55 PM
To: CHTC Users <chtc-users@xxxxxxxxxxx>
Cc: chtc-users@xxxxxxxxxxx <chtc-users@xxxxxxxxxxx>
Subject: HPC Cluster and some HTC components will go down for maintenance Friday or later

Greetings CHTC Users,

 

To resolve some weather-induced complications affecting the temporary chiller for the Discovery datacenter, we will again need to take down the HPC Cluster and some HTC components, likely at 8am this Friday (January 7). It’s possible that the outage will be pushed back to a similar time on Monday (January 10) or even later, so we’ll confirm by tomorrow morning. Regardless of which day the outage starts, we expect it to last most of the day, with services restored by end-of-day, and hopefully not longer.

Affected components will include:

  • the entire HPC Cluster (all: execute nodes, head nodes, filesystem, etc.)
  • a portion of the HTC execute servers
  • several HTC submit servers (including submit2, submit3, learn, and ucsbsubmit)

 

The HPC Cluster has already been configured to not accept new jobs that won’t complete before 8am Friday (will be adjusted if downtime scheduled for later), but jobs already running prior to that configuration change will be interrupted when we take servers down. These and other queued jobs will run (or run again) when the cluster is back up, so users may want to remove any jobs that might not handle this scenario well.

The HTC System will re-queue and re-match any jobs evicted while running on affected HTC execute capacity (about 10% of the total HTC capacity). All jobs in queue on the affected submit servers will be interrupted, as well, but will remain in queue to re-match once the submit servers are booted at the conclusion of the downtime.

 

As always, we appreciate your patience and are happy to answer questions sent to chtc@xxxxxxxxxxx. Unplanned downtimes aren’t exactly a happy start to the new year, and we’re doing what we can to provide notice and minimize impacts.

 

Regards,

Your CHTC Team

[← Prev in Thread] Current Thread [Next in Thread→]