It Actually Paid Off

This past Sunday, I had one of those experiences that pretty much everyone dreads – I had a hard drive fail. However, it was also one of those days where all the prior preparations actually paid off. When I set up the archive disk on my server, I used a software RAID (level 1), so the data was duplicated across a second drive. Also, I had purchased a couple of other drives beforehand as “cold spares”. Even with all the preparations, it was still non-trivial to replace everything and get it running, so I thought I would document my experiences here.

The Setup

First, I’ll start with setting up the RAID in the first place. I basically set up my RAID using this guide, with the main difference that I wasn’t putting my entire setup on a RAID, just my archival space (that area I use to keep all of the stuff I’ve collected since 1996). I tend to avoid overly complex configuration (and anything that is complex I keep backed up), so I tend to believe everything else is replaceable.

At the end of the day, my configuration looks like (abridged):

root@plaidscape /
# fdisk -l

Disk /dev/sdb: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Device    Boot Start        End    Blocks  Id System
/dev/sdb1       2048 1953525167 976761560  fd Linux raid autodetect

Disk /dev/sdc: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Device    Boot Start        End    Blocks  Id System
/dev/sdc1       2048 1953525167 976761560  fd Linux raid autodetect

Disk /dev/sda: 372.6 GiB, 400088457216 bytes, 781422768 sectors
Device    Boot     Start       End    Blocks  Id System
/dev/sda1 *           63     80324     40131  83 Linux
/dev/sda2          80325  16868249   8393962+ 82 Linux swap / Solaris
/dev/sda3       16868250 781417664 382274707+ 83 Linux

Disk /dev/md1: 931.5 GiB, 1000202649600 bytes, 1953520800 sectors

The Event

So, on Sunday, there I was, browsing my archive over samba, when I noticed a distinct lack of response. Where I would normally chalk this up to network lag or something else between my basement and my loft (typically my router), this seemed more notable so I logged in to my server just as my phone buzzed:

From: mdadm monitoring
To: greenethorne@gmail.com
Date: Sun, Aug 10, 2014 at 3:30 PM
Subject: Fail event on /dev/md1:plaidscape

This is an automatically generated mail message from mdadm
running on plaidscape

A Fail event had been detected on md device /dev/md1.

It could be related to component device /dev/sdc1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md1 : active raid1 sdb1[0] sdc1[1](F)
976760400 blocks super 1.2 [2/1] [U_]

unused devices: <none>

When I took a look, I noticed that sdc was, in fact, no where to be found. I suspect the drive just died and that was that. Now… …I just had to recover. Since both drives were essentially identical, I felt compelled to act quickly so that I could hopefully avoid actually losing data if the second drive decided to do the same.

The Resolution (Part I)

At first, I assumed there were tricks I would need to learn to make this easy, so I didn’t rush into fixing it. I started by simply unmounting the drives and doing some research online. A couple of stackoverflow questions made me wonder if this really was going to be easy, but I found a helpful manual to get me through all the details.

There were still a couple of challenges though, the first was figuring out which drive had actually failed. Since sdb (the one good drive left in the RAID) and sdc (the failed drive) were identical models, I needed to figure out which drive was which so that I could pull the correct drive out (to make room for the new drive). After googling around, I found a utility called hdparm – this utility was able to tell me what the serial number on the drive by reading the drive metadata: very handy! Since sdc was effectively gone, I couldn’t use that one, but I could get the serial number from sdb and (by process of elimination) pull the other one.

root@plaidscape /
# hdparm -I /dev/sdb

/dev/sdb:

ATA device, with non-removable media
        Model Number:       ST31000528AS
        Serial Number:      6VPBN12Y
        Firmware Revision:  CC3E
        Transport:          Serial
...more...

With that out of the way, I put the new drive in (and accidentally played musical SATA cables which changed all my drive designations), synced over the partition table, and added it to the RAID. Pretty soon, I saw the sync start to happen in /proc/mdstat:

root@plaidscape ~
# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md1 : active raid1 sdb1[2] sdc1[0]
      976760400 blocks super 1.2 [2/1] [U_]
      [>....................]  recovery =  4.7% (46865344/976760400) finish=121.2min speed=127847K/sec

unused devices: <none>

Being the careful person I am, I decided I would hang out and wait this out while it synced… …And it was a good thing to, because I wasn’t quite done. As the sync got close to 65%, it suddenly stopped. At first, I thought I might have just missed the last 35%, but /proc/mdstat told a different story:

root@plaidscape ~
# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md1 : active raid1 sdb1[2](S) sdc1[0]
      976760400 blocks super 1.2 [2/1] [U_]

unused devices: <none>

It took some looking to figure out what (S) meant (hot spare, as it happens), but the short version of the story was that my sync had failed. /var/log/messages had some very alarming messages that I wasn’t really happy about:

Aug 10 18:12:28 plaidscape kernel: [ 6320.302877] ata4.00: configured for UDMA/133
Aug 10 18:12:28 plaidscape kernel: [ 6320.302901] sd 3:0:0:0: [sdc] Unhandled sense code
Aug 10 18:12:28 plaidscape kernel: [ 6320.302905] sd 3:0:0:0: [sdc]
Aug 10 18:12:28 plaidscape kernel: [ 6320.302907] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 10 18:12:28 plaidscape kernel: [ 6320.302911] sd 3:0:0:0: [sdc]
Aug 10 18:12:28 plaidscape kernel: [ 6320.302913] Sense Key : Medium Error [current] [descriptor]
Aug 10 18:12:28 plaidscape kernel: [ 6320.302917] Descriptor sense data with sense descriptors (in hex):
Aug 10 18:12:28 plaidscape kernel: [ 6320.302919]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Aug 10 18:12:28 plaidscape kernel: [ 6320.302931]         4c 6c 23 9d
Aug 10 18:12:28 plaidscape kernel: [ 6320.302937] sd 3:0:0:0: [sdc]
Aug 10 18:12:28 plaidscape kernel: [ 6320.302942] Add. Sense: Unrecovered read error - auto reallocate failed
Aug 10 18:12:28 plaidscape kernel: [ 6320.302945] sd 3:0:0:0: [sdc] CDB:
Aug 10 18:12:28 plaidscape kernel: [ 6320.302947] Read(10): 28 00 4c 6c 20 00 00 04 00 00
Aug 10 18:12:28 plaidscape kernel: [ 6320.303014] ata4: EH complete

So, my original drive also had issues, a couple of bad blocks to be precise, which was causing the whole sync to fail. At least I had the data, but I needed a new solution for making sure that I could get at it.

The Resolution (Part II)

Here’s where I really felt lucky, and not just good at planning ahead – I had actually purchased 2 spare drives, so, I just needed to set up a new RAID, and copy all the data that I could off the old RAID. This took a bit more work, but it wasn’t too crazy. I just rsynced all the data over (I lost two files from my subversion revision history), verified all the checksums (I do this all the time at work, so it was really old hat), and called it a day.

2 thoughts on “It Actually Paid Off”

  1. I should really back up my data. I don’t have a lot that is irreplaceable, but that which I do have, it would kill me to lose (I actually lost an almost-complete role-playing game I was writing the last time one of my computers died, and I’m still kicking myself about it today).

    How reliable are those portable hard drives? I’ve been toying with the idea of getting one so I could have my whole steam library installed at once (it’s funny, when I got this computer I thought “I’ll never use up 500 GB).

    1. I also need to keep better backups – my current offline backup (containing my entire archive) is about a year old, but I have online copies of all my active documents in 4 places about a week old (RAID 1 in the server, a copy on my desktop, and a copy on my laptop synced weekly via subversion). Now that I have a spare drive, I should probably just make it an offline backup…

      I’m generally a fan of portable hard drives – they certainly enable better backups (the write-and-shelf capability means it’s a true offline backup). If you’re looking for something to install games on, however, I would probably discourage you a bit unless you have USB 3.0 capability, as the speed decrease is going to be very noticeable (300 MB/sec for older internal SATA drives and 133 MB/sec for EIDE vs 35MB/sec over USB 2.0). USB 3.0 promises to be better at 500 MB/sec which could be pretty interesting.

Comments are closed.