Ongoing unplanned outage of the HTC /staging location, some GPU and researcher-owned execute servers


Date: Fri, 12 Nov 2021 15:52:41 +0000
From: chtc-users@xxxxxxxxxxx
Subject: Ongoing unplanned outage of the HTC /staging location, some GPU and researcher-owned execute servers

Greetings,

 

-This message pertains only to CHTC’s HTC System, and not to the HPC Cluster.-

 

We are currently investigating an unplanned outage of the servers that make up the HTC system’s /staging location, as well as several GPU and researcher owned execute servers in the same physical location.

 

We expect that jobs already running when the /staging location went down will have failed to read or write data within it, or may still be hanging. It also appears that jobs requiring servers with ‘HasCHTCStaging’ will continue to match to servers, but will fail or hang when attempting to access the /staging location.

 

We are actively working to address the issue, and to at least stop staging-depending jobs from matching while we restore the /staging location. In the meantime, users may choose to hold jobs that depend on the /staging location to stop them from running and/or keep them from matching until the issue has been resolved. Example commands below:

 

By your username (replace with your own):  condor_hold lmichael

 

By job cluster:  condor_hold 1234567

 

By individual job ID:  condor_hold 1234567.0

 

CHTC staff will provide an update as soon as we have one. As usual, please send any questions to chtc@xxxxxxxxxxx.

 

Thank you,

Your CHTC Team

 

Care of:

Lauren Michael - Research Computing Facilitator, Center for High Throughput Computing, University of Wisconsin - Madison 

Research Facilitation Lead, Open Science Grid; co-PI, PATh; co-PI, CaRCC

lmichael@xxxxxxxx, go.wisc.edu/lmcal, Discovery 2262, (608)316-4430, she/her

 

[← Prev in Thread] Current Thread [Next in Thread→]
  • Ongoing unplanned outage of the HTC /staging location, some GPU and researcher-owned execute servers, chtc-users <=