Hello CHTC users,
This message is for users of our HPC cluster (logging into spark-login.chtc.wisc.edu).
A combination of issues are causing the shared filesystem (backing /home, /scratch, /software) to disconnect unexpectedly from machines in the cluster. As a user, you may see a "permission denied" error when attempting to access files in a disconnected directory.
-
Starting last Friday, a piece of network hardware began behaving incorrectly, causing the disconnects to happen MUCH more frequently: multiple times on Friday, with the symptoms persisting throughout the weekend into this week, with multiple disconnects again
today.
-
There are likely additional system issues causing the disconnects we have seen earlier in the summer (about once a month).
In the meantime, we are trying to mitigate the symptoms as best we can. Users can try reducing the runtime of their jobs to minimize the chance that a disconnect occurs while the job is running.
Best,
The CHTC Facilitation Team
|