All HTC services restored


Date: Fri, 12 Nov 2021 15:03:27 -0600
From: chtc-users@xxxxxxxxxxx
Subject: All HTC services restored

Greetings CHTC users,Â


-This message pertains only to CHTCâs HTC System, and not to the HPC Cluster. -

All HTC system services are back up after the unexpected outage to certain services this morning (see details from our previous email below).Â


If your jobs use files in /staging, we strongly encourage you to check on submitted jobs in your /staging folder to confirm that the interrupted jobs restarted cleanly.Â


Email us with any questions at chtc@xxxxxxxxxxx.


Have a great weekend!


Your CHTC team



---------- Forwarded message ---------
From: chtc-users--- via CHTC-users <chtc-users@xxxxxxxxxxx>
Date: Fri, Nov 12, 2021 at 9:54 AM
Subject: Ongoing unplanned outage of the HTC /staging location, some GPU and researcher-owned execute servers
To: CHTC Users <chtc-users@xxxxxxxxxxx>
Cc: <chtc-users@xxxxxxxxxxx>


Greetings,

Â

-This message pertains only to CHTCâs HTC System, and not to the HPC Cluster.-

Â

We are currently investigating an unplanned outage of the servers that make up the HTC systemâs /staging location, as well as several GPU and researcher owned execute servers in the same physical location.

Â

We expect that jobs already running when the /staging location went down will have failed to read or write data within it, or may still be hanging. It also appears that jobs requiring servers with âHasCHTCStagingâ will continue to match to servers, but will fail or hang when attempting to access the /staging location.

Â

We are actively working to address the issue, and to at least stop staging-depending jobs from matching while we restore the /staging location. In the meantime, users may choose to hold jobs that depend on the /staging location to stop them from running and/or keep them from matching until the issue has been resolved. Example commands below:

Â

By your username (replace with your own): Âcondor_hold lmichael

Â

By job cluster: Âcondor_hold 1234567

Â

By individual job ID: Âcondor_hold 1234567.0

Â

CHTC staff will provide an update as soon as we have one. As usual, please send any questions toÂchtc@xxxxxxxxxxx.

Â

Thank you,

Your CHTC Team

Â
_______________________________________________
CHTC-users mailing list
CHTC-users@xxxxxxxxxxx
To unsubscribe send an email to:
chtc@xxxxxxxxxxx
[← Prev in Thread] Current Thread [Next in Thread→]
  • All HTC services restored, chtc-users <=