Greetings,
-This message pertains only to CHTC’s HTC System, and not to the HPC Cluster.-
We are currently investigating an unplanned outage of the servers that make up the HTC system’s /staging location, as well as several GPU and researcher owned execute servers in the same physical location.
We expect that jobs already running when the /staging location went down will have failed to read or write data within it, or may still be hanging. It also appears that jobs requiring servers with ‘HasCHTCStaging’ will continue to
match to servers, but will fail or hang when attempting to access the /staging location.
We are actively working to address the issue, and to at least stop staging-depending jobs from matching while we restore the /staging location. In the meantime, users may choose to hold jobs that depend on the /staging location to
stop them from running and/or keep them from matching until the issue has been resolved. Example commands below:
By your username (replace with your own): condor_hold lmichael
By job cluster: condor_hold 1234567
By individual job ID: condor_hold 1234567.0
CHTC staff will provide an update as soon as we have one. As usual, please send any questions to chtc@xxxxxxxxxxx.
Thank you,
Your CHTC Team
Care of:
Lauren Michael - Research Computing Facilitator, Center for High Throughput Computing, University
of Wisconsin - Madison
Research Facilitation Lead, Open Science Grid;
co-PI, PATh; co-PI, CaRCC
lmichael@xxxxxxxx, go.wisc.edu/lmcal, Discovery
2262, (608)316-4430, she/her