Search this site


Metadata

Articles

Projects

Presentations

RAID is not Redundant.

My year at Rocket Fuel has seen many unique system failures. One specific kind of failure I want to highlight is those of full RAID failures. I've talked before about how RAID is not a backup technology.

Tonight, we rebooted a machine that hung (presumably due to OOM or other funkiness) and it came back in the bios saying:

Foreign configuration(s) found on adapter
Our managed hosting support weren't sure what to make of this, so we decided to make a new home (from backups) for the services on this now-dead machine. Dell won't helping debug on this until tomorrow.

This is one of many total data losses I've observed on RAID sets in recent months - all due to RAID failures. Thankfully, We have backups that get shipped to HDFS. We monitor those backups. We also have puppet and other automation to help move and rebuild services on a new host. We're equipped to handle this kind of failure.

This leads me to a new conclusion: The 'R' in RAID is a lie. It is not redundant. Treating it that way can lead you to the raid-is-backup fallacy.

Wikipedia has this to say about Redundancy (engineering): "In engineering, redundancy is the duplication of critical components of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe."

Adding more parts (complexity) to a system doesn't often increase its reliability. Even taking into account the disk redundancy you might get with mirror or parity, you're still hedging that the RAID card doesn't die, which it will. Everything will fail (see MTBF) eventually, and MTTR on RAID failures is usually quite high, so weigh your risk.

Back to my conclusion that RAID is not redundant. RAID is not dead, I'm just done viewing RAID as a continuity-through-drive-failure technology. RAID has other benefits, though. It achieves more than just redundancy (when your card doesn't die).

RAID makes multiple drives present as a single drive device to the OS, right? Right. RAID allows you to aggregate disk IO performance to achive higher read/write rates than with a single disk alone. You can also aggregate disk space this way, too, if you didn't know.

It's almost 0100 now, I'd much rather be sleeping or playing TF2 than helping rebuild from backups.

RAID is not a backup techology.

I spent some time yesterday making backups of things I care about (code, content here, etc) in two remote places, in case anything should happen. Now two places are copying down data for me every few days. Backups are easy to ignore, but critical when you lose data.

Today, I heard about journalspace going down because it lost data and didn't have backups. While I don't use the service, so it doesn't affect me, the failure they experienced makes for a great case study in false data security and backups. From the front page of journalspace.com:

Here is what happened: the server which held the journalspace data had two large drives in a RAID configuration. As data is written (such as saving an item to the database), it's automatically copied to both drives, as a backup mechanism.

The value of such a setup is that if one drive fails, the server keeps running, using the remaining drive. Since the remaining drive has a copy of the data on the other drive, the data is intact. The administrator simply replaces the drive that's gone bad, and the server is back to operating with two redundant drives.
RAID is not a backup solution. RAID can get you, depending on the configuration, better throughput and/or better data reliability. If you lose a drive in some raid configurations, the system can continue working normally without that drive. Backups should copy data somewhere other than the machine hosting the original data. The page goes on:
So, after nearly six years, journalspace is no more.
After almost 6 years nobody had a cron job that backed up data to somewhere offsite (or a more complex backup system)? Ouch! My condolences to journalspace and its users on the loss.

Losing important data unexpectedly will sting you bad if you don't have appropriate backups. The only thing to do is learn from this mistake and move on, accepting the consequences of the loss.

This isn't the first website I've heard of having to shutdown because they permanently lost data. Learn from their mistakes, keep backups of your stuff!

Booting from SATA on ASUS K8N-DL.

So my new fancy computer is here. Turns out I originally bought the wrong formfactor motherboard, because I had a silly moment.

Either way, I've now got the system running, but not without some serious battle scars.

Ubuntu happily installed (very slow to partition/newfs stuff though). However, upon reboot, the bios clearly couldn't see the boot drive. My SATA drives are plugged into the on-board Silicon Image RAID controller with no raid configurations set up.

Guessing, I told the raid controller to create a 1-disk concatonation with the disk I wanted to boot from. Voila, the BIOS sees the one disk now and I can boot from it. Linux finds the other two SATA drives when booting.

Sigh..

Also, when Ubuntu says "Computing the new partitions" it really means "I'm creating a new partition right now. Go get something to eat, I'm going to be here for a while." Large partitions, for some reason, take quite some time to create.