Ongoing unplanned outage of the HTC /staging location, some GPU and researcher-owned execute servers

Date:	Fri, 12 Nov 2021 15:52:41 +0000
From:	chtc-users@xxxxxxxxxxx
Subject:	Ongoing unplanned outage of the HTC /staging location, some GPU and researcher-owned execute servers

Greetings,

-This message pertains only to CHTC’s HTC System, and not to the HPC Cluster.-

We are currently investigating an unplanned outage of the servers that make up the HTC system’s /staging location, as well as several GPU and researcher owned execute servers in the same physical location.

We expect that jobs already running when the /staging location went down will have failed to read or write data within it, or may still be hanging. It also appears that jobs requiring servers with ‘HasCHTCStaging’ will continue to match to servers, but will fail or hang when attempting to access the /staging location.

We are actively working to address the issue, and to at least stop staging-depending jobs from matching while we restore the /staging location. In the meantime, users may choose to hold jobs that depend on the /staging location to stop them from running and/or keep them from matching until the issue has been resolved. Example commands below:

By your username (replace with your own): condor_hold lmichael

By job cluster: condor_hold 1234567

By individual job ID: condor_hold 1234567.0

CHTC staff will provide an update as soon as we have one. As usual, please send any questions to chtc@xxxxxxxxxxx.

Thank you,

Your CHTC Team

Care of:

Lauren Michael - Research Computing Facilitator, Center for High Throughput Computing, University of Wisconsin - Madison

Research Facilitation Lead, Open Science Grid; co-PI, PATh; co-PI, CaRCC

lmichael@xxxxxxxx, go.wisc.edu/lmcal, Discovery 2262, (608)316-4430, she/her

[← Prev in Thread]	Current Thread	[Next in Thread→]
Ongoing unplanned outage of the HTC /staging location, some GPU and researcher-owned execute servers, chtc-users <=

Previous by Date:	Full HPC and Partial HTC Outages Nov 18 - Nov 22, chtc-users
Next by Date:	All HTC services restored, chtc-users
Previous by Thread:	Office Hours cancelled today (Nov 2), chtc-users
Next by Thread:	Outage since late Friday, Nov 26; HPC Cluster and parts of the HTC System are still down, chtc-users
Indexes:	[Date] [Thread]