[Chtc-users] Change to HPC Cluster policy, to preserve filesystem performance


Date: Mon, 22 Jun 2015 18:16:28 -0400
From: chtc-users@xxxxxxxxxxx
Subject: [Chtc-users] Change to HPC Cluster policy, to preserve filesystem performance
Greeting CHTC Users,

The below message pertains to users of CHTC's HPC Cluster (via the aci-service-1 headnode), and does not affect users of our HTC System (where users submit jobs via "condor").


We are aware of poor filesystem performance on the HPC ClusterÂ
over the past few months, and have determined that the performance is related both to user behavior andÂa need for the filesystem to be updated with better-performing features of the filesystem software. We have a plan to update and upgrade the filesystem software within the next few months, after extensive testing, but more measures are needed to make sure the cluster is still operating well for everyone who needs it.


Therefore, we're taking the following steps:

1. Asking ALL HPC Cluster users to remove AS MUCH DATA AS POSSIBLE, AS SOON AS POSSIBLE
We will also be emailing researchers who are specifically using more than 1 TB of total disk space, and offering to remove data FOR YOU.


2.ÂWe are implementing a limit of '12' jobs that any user can haveÂrunning.
While you'll be able to submit multiple jobs to the queue, you'll only be able to have 12 jobs running at the same time, including both interactive and non-interactive jobs, while also staying within the existing limit of 400 total occupied cores per user. This change in policy will help to minimize the negative impact of multiple output "writes" by so many jobs (a major contributor to filesystem issues), but without reducing the usefulness of the cluster for jobs requiring Infiniband to access many more CPUs on a per-job basis. The policy is also subject to change, in case the reduction to "12" running jobs is not enough.
This policy will take place sometime tomorrow, June 23.ÂOtherwise, some users with large batches of smaller jobs have been temporarily given a 100-core running limit, just to keep the filesystem in usable condition, until we can fully implement the limit on number of jobs running.

Alternatively, we would like to assist users of multi-core, single-server jobsÂin transitioning to our HTC System,
which has much more single-server capacity, and which we are working on better optimizing for multi-core work. Currently, users can get up to dozens of multi-core jobs running on our HTC System, without adverse affects for other users. Furthermore, large batches of single-server jobs are really a form of high-throughput computing, and do not need the Infiniband networking of the HPC Cluster.


3. The filesystem will require a complete upgrade in a few months
after significant design and testing of filesystem features that will perform better under the load of our growing cluster. The cluster will be down for several days during the transition at some point within the next few months, and we'll give plenty of notice for exact dates when the time comes.


Please contact us if you'd like to meet to discuss running single-server multi-core work on the HTC System!


Regards,
Your CHTC Team
[← Prev in Thread] Current Thread [Next in Thread→]
  • [Chtc-users] Change to HPC Cluster policy, to preserve filesystem performance, chtc-users <=