Raid5 Failures HDD MTBF XRaid2 Raid6 Dual Redundancy

MrCyberdude · ‎2010-06-27

Why Raid5(Single Redundancy) is not enough and why Raid6(Dual Redundancy) is better.

I was reading this with interest some time ago ..http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html.. then I felt an overwhelming need for a Backup of everything on my Raid5 array. Before reading this article I was not really concerned about what I percieved to be a low risk of a raid5 rebuild Failure. I of course had backups of the most important data but that would be around 25% of storage and not all of that was, shall we say sorted.

In the end I decided I would have no real choice but to run in XRaid Dual Redundancy(Raid6) mode, In fact these articles are probably the sole reason why I decided to go for the 6 Hard Disk ReadyNas Pro and running it in XRaid2 DR Mode (aka Raid6 with expansion). I went with the XRaid over plain Raid6 mode as I believe that it aids in replacing disks in the future and allows for expansion over time, this in turn helps with varying the ages of the HDD and possibly reducing the risks of all the HDD dying at around the same time.

My primary reason for this was that even If i did backup 100% (unlikely) I would most likely have issues with it being spread over a number of places as well as different media. TIME is my enemy... I cannot make it simpler than this... I do not want to have to make the time for recovering all of my data... My budget is not unlimited... Unfortunatly.!

These articles also give a real reason to Backup and not rely on Raid arrays alone.

Related links.
http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162
http://storagemojo.com/2007/02/19/googles-disk-failure-experience/
http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/
http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/
http://www.tomshardware.com/news/RAID-5-Doomed-2009,6525.html

Unfortunatly as most people are never going to read the full article I will dump a lot of quotes as I believe this may change the minds of more than a few people and may save their data.

==== Quoted From Links ====

Without mentioning companies

that means today we ship more array-based FC & ATA disk capacity than EMC, HP, Sun and our OEM partner IBM as listed here in StorageMojo’s open letter. That key statistic helps add unmatched credibility to our responses surrounding this issue and the specific points raised below.

1. Failure rates are several times higher than reported by drive companies
2. Actual MTBFs (or AFRs) of “enterprise” and “consumer” drives are pretty much the same
3. SMART is not a reliable predictor of drive failure
4. Drive failure rates rise steadily with age rather than staying flat through some n-year mark
5. Array disk failures are highly correlated, making RAID 5 two to four times less safe than assumed

When the vendor specs a 300,000 MTBF – common for consumer PATA and SATA drives – what they are saying is that for a large population of drives half the drives will fail in the first 300,000 hours of operation.

MTTF(The mean time to failure) Manufacturer Lies

About 100,000 disks are covered by this data, some for an entire lifetime of five years. The mean time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%.
We find that in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems.

Manufacturer Mistakes happen.

A customer had 11,000 SATA drives replaced in Oct. 2006 after observing a high frequency of media errors during writes. Although it took a year to resolve, the customer and vendor agreed that these drives did not meet warranty conditions. The cause was attributed to the breakdown of a lubricant leading to unacceptably high head flying heights. In the data, the replacements of these drives are not recorded as failures.

SMART is not that smart

The Google team found that 36% of the failed drives did not exhibit a single SMART-monitored failure. They concluded that SMART data is almost useless for predicting the failure of a single drive.
So while your disk drive might crash without warning at any time, they did find that there are four SMART parameters where errors are strongly correlated with drive failure:

scan errors , reallocation count , offline reallocation , probational count

For example, after the first scan error, they found a drive was 39 times more likely to fail in the next 60 days than normal drives. The other three correlations are less striking, but still significant.

Googlers found little correlation between disk workload and failure rates

Failure rates do not increase when the average temperature increases. At very high temperatures there is a negative effect, but even that is slight

To be continued....

MrCyberdude · ‎2012-11-09

Raid5 Failures vs Raid6 Raid7 XRaid2 Dual Triple Stripe with Parity Redundancy
Does RAID 6 stop working in 2019?

http://storagemojo.com/2010/02/27/does-raid-6-stops-working-in-2019/

Late 2009 Sun engineer, DTrace co-inventor, flash architect and ZFS developer Adam Leventhal, analyzed RAID 6 as a viable data protection strategy.
He lays it out in the Association of Computing Machinery’s Queue magazine, in the article Triple-Parity RAID and Beyond.

The good news: Mr. Leventhal found that RAID 6 protection levels will be as good as RAID 5 was until 2019.

The bad news: Mr. Leventhal focussed on enterprise drives whose unrecoverable read error (URE) spec has improved faster than the more common SATA drives.
SATA RAID 6 will stop being reliable sooner unless drive vendors get their game on.

Triple-Parity RAID and Beyond http://queue.acm.org/detail.cfm?id=1670144

StephenB · ‎2012-11-10

The analysis as presented is misleading. Note the scenario that is graphed is the probability of data loss in a 7 drive array after a single drive failure.

So the RAID-5 protection is already gone in the graph, it is simply attempting to compute the odds that there is another undetected bad sector on one of the remaining drives. There are some assumptions in the data model - particularly that unrecoverable read errors are simply random events that occur every X reads (and are therefore not related to the disk's condition). If that were true, scrubbing a 12 TB raid-5 array would result in (on average) one read failure every time it was done. That is not my experience.

Graphing by year is also attention-getting, but rather silly.

The main takeaways are ones already noted in the forum:

MrCyberdude · ‎2012-11-10

What I am saying is that when I read all the white papers out there in regards to the failure of RAID5 rebuilds you should reconsider using X-Raid2.
The chart was used to show the difference between Raid5 and Raid6 and if you ready the whitepaper then you will see that the results in reality are probably worse as the paper was written in regard to using enterprise grade drives and not the normal consumer HDD. Enterprise HDD have far higher reliability, some in the order of <1 in 10^16 Non-recoverable read errors per bits read but most SATA drives are 2 orders of magnitude less <1 in 10^14. So its a good enough graph to peek some interest into moving to X-Raid2.

I have had every RAID5 automatic rebuild fail briefly at some point. They have been recoverable but alas they have all failed to automatically rebuild. This is not good news for the average consumer that plugs in and forgets. Its all about education and migration from Raid5 to Raid6 needs to start soon due to the size of HDD's now reaching 4TB.
For me, one time only by doing multiple restarts of the NAS resulted in success, but not normally. So far, touch wood. I have had no problems with RAID6 aka X-Raid2 rebuilds in the ReadyNASPro.
I believe this is due to the double distributed parity and block level striping which help to prevent the bad old days of a single hot drive while rebuilding the array which Im sure contributed to melt down, maybe if my drives had RVS or better thermal compensation it might not have been a problem.

StephenB · ‎2012-11-10

I have read the papers. All we disagree on is the storagemojo article, which I think is misleading.

We agree that people should consider RAID-6 if they are using a large array - because as you say it is quite possible to have a second disk failure before the first disk is replaced and the array is rebuilt. In my view, you also need to have at least one independent copy of your data, and preferably two, which is a point you also make in the first post.

Raid5 Failures HDD MTBF XRaid2 Raid6 Dual Redundancy

Raid5 Failures HDD MTBF XRaid2 Raid6 Dual Redundancy

Raid5 Failures Raid6 Raid7 XRaid2 Dual Triple Redundancy

Re: Raid5 Failures HDD MTBF XRaid2 Raid6 Dual Redundancy

Re: Raid5 Failures HDD MTBF XRaid2 Raid6 Dual Redundancy

Re: Raid5 Failures HDD MTBF XRaid2 Raid6 Dual Redundancy