Re: [Chtc-users] submit-1.chtc.wisc.edu is back UP


Date: Mon, 9 Jul 2012 17:16:08 +0000
From: "Gore, Brooklin" <BGore@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: [Chtc-users] submit-1.chtc.wisc.edu is back UP
We are happy to report that submit-1 is back online.

We where able to recover all data except for a 2k block. Unfortunately, we
don't know where the missing data was allocated. It could be in one of
your files, or unallocated space. You should check the consistency and
accuracy of your files to be sure.

As Condor was still running jobs for a while when the filesystem reverted
to read only mode last week, you should verify the status of any runs in
progress at that time, and the validity of any related files.

UW CHTC Team

On 7/5/12 5:46 PM, "Gore, Brooklin" <BGore@xxxxxxxxxxxxxxxxxxxxxx> wrote:

>After working on this issue today, it looks like we may be able to recover
>much of the data in /home.
>
>Our plan is to:
>
>1. Copy /home data to known-good storage.
>2. Get submit-1 back online (with empty home directories).
>3. Make the original data available.
>
>We expect to have submit-1 back in production (with empty home
>directories) on Monday, July 9.
>
>We hope to be able to provide access to pre-July 4th home directory data
>on Tuesday, July 10.
>
>If you have have a critical need to submit jobs before Monday, please
>contact us to arrange access to an alternate submit node.
>
>We apologize for this inconvenience.
>UW CHTC Team
>
>On 7/5/12 11:43 AM, "Todd Tannenbaum" <tannenba@xxxxxxxxxxx> wrote:
>
>>
>>Hello,
>>
>>On July 4th the general-purpose CHTC submit node
>>    submit-1.chtc.wisc.edu
>>suffered a hard drive failure.  As this machine uses RAID (redundant
>>disk arrays), any one hard drive failure is not a problem.
>>Unfortunately, while submit-1 was in the process of automatically
>>recovering from the failed hard drive, a second hard drive failed.
>>Having two hard drives fail within five minutes of each other is not a
>>level of redundancy that submit-1 was designed to tolerate.
>>
>>As a result, submit-1.chtc is currently unavailable and will likely
>>remain unavailable for the bulk of today. We are currently working on
>>it, and will send out further news as it develops this afternoon.  It is
>>too early to say how much data stored on submit-1 is salvageable.
>>
>>Groups that submit jobs from a server other than submit-1.chtc should
>>not be impacted.
>>
>>Thank you for your patience,
>>UW CHTC Staff
>>
>>_______________________________________________
>>Chtc-users mailing list
>>Chtc-users@xxxxxxxxxxx
>>To unsubscribe send an email to:
>>chtc@xxxxxxxxxxx
>>https://lists.cs.wisc.edu/mailman/listinfo/chtc-users
>>
>

[← Prev in Thread] Current Thread [Next in Thread→]