Issues with HPC cluster file systems


Date: Tue, 27 Aug 2024 22:04:55 +0000
From: chtc-users@xxxxxxxxxxx
Subject: Issues with HPC cluster file systems
Hello CHTC users,

This message is for users of our HPC cluster (logging into spark-login.chtc.wisc.edu).

A combination of issues are causing the shared filesystem (backing /home, /scratch, /software) to disconnect unexpectedly from machines in the cluster. As a user, you may see a "permission denied" error when attempting to access files in a disconnected directory. 

  • Starting last Friday, a piece of network hardware began behaving incorrectly, causing the disconnects to happen MUCH more frequently: multiple times on Friday, with the symptoms persisting throughout the weekend into this week, with multiple disconnects again today. 
    • We are working with the vendor to fix the network hardware, which should address the current high rate of disconnects. 
  • There are likely additional system issues causing the disconnects we have seen earlier in the summer (about once a month). 
    • An upcoming downtime is planned to address the cause of the infrequent disconnects.

In the meantime, we are trying to mitigate the symptoms as best we can. Users can try reducing the runtime of their jobs to minimize the chance that a disconnect occurs while the job is running. 

We are also providing updates regarding the incident on our status page at https://status.chtc.wisc.edu/.

Questions and concerns can be emailed to chtc@xxxxxxxxxxx.

Best, 
The CHTC Facilitation Team


[← Prev in Thread] Current Thread [Next in Thread→]
  • Issues with HPC cluster file systems, chtc-users <=