HPC cluster back up; certain HTC execute servers down


Date: Wed, 28 Apr 2021 11:53:41 -0500
From: chtc-users@xxxxxxxxxxx
Subject: HPC cluster back up; certain HTC execute servers down
Greetings CHTC users,

We have two quick announcements for your Wednesday morning.
  • For high performance computing (HPC) cluster users:
    • Yesterdayâs cluster maintenance took a bit longer than expected, but was completed successfully! As of 7:30 last night, the cluster was back to usual operation and jobs should be running again.
  • For high throughput computing (HTC) system users:
    • Due to a cooling issue in one of our server rooms, a subset of execute servers in our HTC system have been down since last night.
    • Impact to users: jobs that were running on the impacted servers were interrupted but stayed in the queue and will be automatically re-run. The loss of this server room diminishes our overall capacity somewhat so you may see fewer jobs running in general.
    • Users that use SQUID for file transfer should check for any jobs held with a message like "Error: Aborted due to lack of progress using http_proxy=http://squid-cs-b240.chtc.wisc.edu:3128," which can be safely released.
    • There are no significant changes to the overall operation of the HTC system; users should continue to submit jobs as normal.
Thanks for your patience with the many emails this week - some of this was planned, but we obviously have experienced some unexpected issues this week. As always, contact us at chtc@xxxxxxxxxxx with any questions or concerns.

Cheers,
Your CHTC team
[← Prev in Thread] Current Thread [Next in Thread→]
  • HPC cluster back up; certain HTC execute servers down, chtc-users <=