Resending: HPC Cluster Downtime March 13-16; ALL HPC CLUSTER USER DATA WILL BE DELETED


Date: Fri, 16 Feb 2018 10:34:41 -0600
From: chtc-users@xxxxxxxxxxx
Subject: Resending: HPC Cluster Downtime March 13-16; ALL HPC CLUSTER USER DATA WILL BE DELETED
Apologies for those who are receiving this information twice (see below), but it has come to our attention that Office 365 may have sent it to spam for some of you.

Please add chtc-users@xxxxxxxxxxx to your address book.Â

We've also posted the below information to our User News page and added notices to the HPC Cluster Basic Use Guide.

Thank you,
Your CHTC Team

On Wed, Feb 14, 2018 at 1:07 PM, <chtc-users@xxxxxxxxxxx> wrote:
Greetings HPC Cluster Users,
(those who only use the HTC System can ignore the below)

The HPC Cluster will be taken down March 13 for a major upgrade of the filesystemÂ
We will be rebuilding the /home location with a new version of the filesystem, which will come with user-level quota limits on total data and filecounts. These changes are intended to improve performance and address bugs in the filesystem software. See our prior email, below, when we first announced plans for the downtime.

THE FILESYSTEM CONTENTS MUST BE COMPLETELY DELETED FOR THE REBUILD AND CANNOT BE SAVED FOR YOU BY CHTC.Â
  1. TAKE ACTION NOWÂto transfer ALL of your data and software within /home to a non-CHTC location. Users waiting until the last minute will risk loss of data when we clear and rebuild the filesystem (if they haven't been properly backing up to a non-CHTC location, all along). We cannot postpone the downtime for such circumstances. See also #2-3.
  2. CHTC is not able to backup or otherwise reinstate ANYÂuserÂdata from the current filesystem and is not responsible for loss of user data when we have to delete it for the rebuild process.
  3. As a reminder, you should have NO DATA on the HPC Cluster that you have not already backed up elsewhere and that you are not ACTIVELY running jobs with (including software). This expected practice has been inÂCHTC's stated policiesÂsince the HPC Cluster was first introduced in 2013 and will be key for your ability to continue work up until the downtime, only needing to remove software and data from your most recently-completed jobs in the days before-hand.
  4. Do not have more than one data transfer process occurring through the head nodes (aci-service-1 or aci-service-2) at a time. Too many users with too many data transfers or deletion ('rm') processes will create network and filesystem performance issues for all users. See also #1.
USERS WILL BE ABLE TO REINSTATE THEIR DATA (WITH NEW QUOTAS) AFTER THE DOWNTIME.
  • The new filesystem buildÂwill include a new initial per-user quota ofÂ100 GB of space and 1000 file/directory counts. Researchers needing more than that amount for concurrently-running jobs and/or software files will need to consult with a Research Computing Facilitator viaÂchtc@xxxxxxxxxxxÂafter the downtime, to ensure proper data practices.
  • All CHTC-supported software modules (compilers, MPI versions, and licensed software) will be preserved and reinstated for identical use after the downtime.
  • Software in your home directory that you've copied off of the cluster before the downtime can be reinstated within the same directory location (/home/username/etc).

PLAN ACCORDINGLY FOR A DELAY IN YOUR COMPUTATIONAL WORK
We intend to complete the necessary work by March 16, but it's possible we'll need 1-2 days more or less than that, and we'll announce any timeline changes as soon as we know them. Please plan for a delay of your computational work, accordingly, including time after the downtime to re-establish your data structures.

We appreciate your action and planning in support of our work to minimize interruptions/downtime for all users of the HPC Cluster, though we do need to take necessary actions like this planned downtime to make improvements to cluster components.


As always, please send any questions toÂchtc@xxxxxxxxxxxÂ(rather than replying to this email list for all CHTC users).

Thank you!
Your CHTC Team, care of Lauren Michael

On Fri, Dec 15, 2017 at 12:08 PM,Â<chtc-users@xxxxxxxxxxx>Âwrote:
Greetings HPC Cluster Users,

(those who only use the HTC System can ignore the below)


We are writing for two reasons, see below:

1. Begin removing ALL data from the HPC Cluster
Users need to begin removing ALL data (from their /home/user space) on the HPC Cluster so that we can rebuild the HPC Cluster's filesystem during a downtime planned forÂlate February or early March of 2018.
  • The entire HPC Cluster filesystem will need to be deleted for this downtime, and CHTC will not be able to keep any copies of user data to restore after the downtime.
  • We will follow up in the coming weeks to elaborate on the downtime, and how users can expect to copy data (and software) back to the HPC Cluster after the downtime.
  • TAKE ACTION NOW TO ENSURE THAT YOUR DATA EXISTS ELSEWHERE AND THAT ALL OLD DATA IS REMOVED FROM THE CLUSTER ASAP.
As a reminder ofÂCHTC data policies, which all users are responsible for:
  • ONLY data that is being used or produced byÂactively-queued jobsÂshould ever exist on the cluster.
  • Data from completed work should be copied to another non-CHTC project location accessible to you, as soon as possible after jobs complete. Data left to accumulate reduces filesystem performance for yourself and all other users
  • CHTC data locations areÂNOTÂbacked up, so you should always keep copies of essential data (scripts, submit files, etc.) in alternative, non-CHTC locations where you keep other research project data, and copy it back to the HPC Cluster when you need it, again.
In the coming weeks,Âwe will contact specific users who are obviously violating the above policiesÂby accumulating large amounts of data that have been left on the cluster for some time.

2. Reminder: Never run computational work on HPC Cluster head nodes
We have noticed a recent increase in the number of users running processes on the head nodes, which contributes to performance and filesystem issues/failures for ALL users.

As a reminder ofÂCHTC policies, which all users are responsible for:
  • ALL computational work should ONLY be run within a SLURM-scheduledÂinteractive or batch job session, and never on the head nodes.ÂThis especially includes any scripts, software, or other processes that perform data manipulation/creation and long-running scripts for data management (including cron tasks).
  • Only simple commands for file and directory management are appropriate to run on the head nodes (e.g. file transfers, compression/decompression of transferred data, directory creation, etc.).
  • CHTC staff will deactivate the login access of users who violate the above policies, as compute-intensive tasks almost always create issues for other users by slowing or crashing the head nodes. We may not be able to immediately notify users that their accounts have been deactivated.
Repeat offenders may be required to involve their faculty sponsors to reinstate their login access, or may lose all access to CHTC resources. If you think youÂtrulyÂneed to run something on the head nodes that may violate the above policies, please don't hesitate to get in touch so that we can help explore practices that will not cause issues for other users.


Thank you, as always, for helping us to uphold and improve upon CHTC systems and their performance for all users. Please send any questions or concerns toÂchtc@xxxxxxxxxxx.

Regards,
Your CHTC Team

_______________________________________________
CHTC-users mailing list
CHTC-users@xxxxxxxxxxx
To unsubscribe send an email to:
chtc@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/chtc-users

_______________________________________________
CHTC-users mailing list
CHTC-users@xxxxxxxxxxx
To unsubscribe send an email to:
chtc@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/chtc-users


[← Prev in Thread] Current Thread [Next in Thread→]
  • Resending: HPC Cluster Downtime March 13-16; ALL HPC CLUSTER USER DATA WILL BE DELETED, chtc-users <=