HPC cluster back up; certain HTC execute servers down

Date:	Wed, 28 Apr 2021 11:53:41 -0500
From:	chtc-users@xxxxxxxxxxx
Subject:	HPC cluster back up; certain HTC execute servers down

Greetings CHTC users,

We have two quick announcements for your Wednesday morning.

For high performance computing (HPC) cluster users:

Yesterdayâs cluster maintenance took a bit longer than expected, but was completed successfully! As of 7:30 last night, the cluster was back to usual operation and jobs should be running again.

For high throughput computing (HTC) system users:

Due to a cooling issue in one of our server rooms, a subset of execute servers in our HTC system have been down since last night.
Impact to users: jobs that were running on the impacted servers were interrupted but stayed in the queue and will be automatically re-run. The loss of this server room diminishes our overall capacity somewhat so you may see fewer jobs running in general.
Users that use SQUID for file transfer should check for any jobs held with a message like "Error: Aborted due to lack of progress using http_proxy=http://squid-cs-b240.chtc.wisc.edu:3128," which can be safely released.
There are no significant changes to the overall operation of the HTC system; users should continue to submit jobs as normal.

Thanks for your patience with the many emails this week - some of this was planned, but we obviously have experienced some unexpected issues this week. As always, contact us at chtc@xxxxxxxxxxx with any questions or concerns.

Cheers,
Your CHTC team

[← Prev in Thread]	Current Thread	[Next in Thread→]
HPC cluster back up; certain HTC execute servers down, chtc-users <=

Previous by Date:	Support email is back; reminder about the HPC cluster downtime tomorrow (April 27), chtc-users
Next by Date:	, (nil)
Previous by Thread:	, (nil)
Next by Thread:	HPC Cluster downtime on Tuesday, April 27, chtc-users
Indexes:	[Date] [Thread]