[Chtc-users] HPC Cluster and HTC Pool restored after weekend power outages


Date: Mon, 28 Sep 2015 16:43:28 -0500
From: chtc-users@xxxxxxxxxxx
Subject: [Chtc-users] HPC Cluster and HTC Pool restored after weekend power outages
Greetings CHTC Users,

We appreciate your patience and assure you that we're working as quickly as possible to restore all functionality to CHTC compute systems. The power failure on Saturday affected multiple campus buildings and many campus services, including ALL of CHTC's compute servers. Please see further details below.


THE HPC CLUSTER IS RUNNING!Â
Important notes as you begin using the cluster again:
  • All jobs that were in the queue as of the power failure on Saturday will need to be resubmitted and re-run.
  • Usage-base user priorities are still in effect. (Read more here.)Â

The HTC Pool and CHTC submit servers are restored!Â
Regarding submit servers:
  • submit-3, submit-4, and submit-5 are restoredÂand ready for new jobs
  • Submit servers that areÂnotÂadministered or hosted by CHTCÂshould be unaffected, other than interruptions to jobs that were running in the CHTC Pool when the power failure occurred on Saturday. Such jobs should have been automatically re-queued on your submit server and are likely already running again.
Regarding the recovery of jobs:
  • CHTC's HTCondor pool of execute serversÂhas been restored since Sunday, so jobs in the queue should be automatically re-run, and you should not need to re-queue jobs. Let us know if any of your jobs have run and failed in a way that you don't expect.
  • Jobs requiring theÂHTC GlusterÂmay experience initial delays in running until Gluster availability has been fully restored across the pool. Continue to include "HasGluster" as a job requirement to make sure your jobs don't run on execute servers that don't have Gluster access restored yet.
  • Jobs identified with "WantGlidein" will again be able to run on the OSG, now that we've restored local OSG capabilities.

As always, please let us know if you have any questions or difficulties, especially if you see behavior that is inconsistent with what we have described above. You can always email us at chtc@xxxxxxxxxxx, and should expect a first response within a few business hours.


Thank you,
Your CHTC Team
[← Prev in Thread] Current Thread [Next in Thread→]
  • [Chtc-users] HPC Cluster and HTC Pool restored after weekend power outages, chtc-users <=