× NETGEAR will be terminating ReadyCLOUD service by July 1st, 2023. For more details click here.
Orbi WiFi 7 RBE973
Reply

Re: Failures of various RAID modes.

BaJohn
Virtuoso

Failures of various RAID modes.

I'm intrigued by the failures.
dbott67 wrote:
........ and 2 multiple disk failures where I had to replace the drives and restore from backup. In each case, I was able to recover without data loss.
...... in the last few months I have had three different servers at work suffer from major disk/RAID failure (or other hardware issues) and been able to recover .....

What RAID types were the multiple failures on?

I started looking at the RAID scenarios at the back end of last year, and came to the conclusion, that in a large multidisc system with large disc sizes, I would be a fool NOT to go for RAID10.
Basically the chances of having a second disc failing during rebuild of the single disc failing RAID are SO HIGH as to be nearer certainty (RAID5) than very unlikely.
So I went for 6 discs of 4TB in a RAID10 in early January 2015 (costing a lot of money).
However as far as I can see in this forum, very few people have RAID10 configurations, and a lot have RAID5, which I thought was more or less obsolete.
I am still hoping I made the right choice, but it would be nice to know from genuine catastrophic system failures that it was money well spent.
Apologies as I seem to be trying to justify myself at the bad fortune of others 😞 .
Message 1 of 9
StephenB
Guru

Re: Just wondering..

BaJohn wrote:
I'm intrigued by the failures.
dbott67 wrote:
........ and 2 multiple disk failures where I had to replace the drives and restore from backup. In each case, I was able to recover without data loss.
...... in the last few months I have had three different servers at work suffer from major disk/RAID failure (or other hardware issues) and been able to recover .....

What RAID types were the multiple failures on?

I started looking at the RAID scenarios at the back end of last year, and came to the conclusion, that in a large multidisc system with large disc sizes, I would be a fool NOT to go for RAID10.
Basically the chances of having a second disc failing during rebuild of the single disc failing RAID are SO HIGH as to be nearer certainty (RAID5) than very unlikely.
So I went for 6 discs of 4TB in a RAID10 in early January 2015 (costing a lot of money).
However as far as I can see in this forum, very few people have RAID10 configurations, and a lot have RAID5, which I thought was more or less obsolete.
I am still hoping I made the right choice, but it would be nice to know from genuine catastrophic system failures that it was money well spent.
Apologies as I seem to be trying to justify myself at the bad fortune of others 😞 .
Off topic, but RAID-5 also requires 2 disks to fail to kill the array. The odds are this happening are not as high as you seem to think, there are many posts here from people who had it happen to them.

RAID-10 is marginally more reliable than RAID-5 because it will survive some 2-disk failures. There are 15 combinations of 2-disk failures in a 6 disk system. RAID-5 always fails. RAID-10 survives 6 of them (all cases where both disks fail in the same RAID-0 array). RAID-6 survives them all with less disk space overhead than RAID-10. RAID-10 has potentially higher performance than RAID-5 or RAID-6 - which is why enterprises often deploy it. However since a high-end NAS will saturate a gigabit network with RAID-5 or RAID-6, you only see those performance gains if you are (a) running a 10 gigabit network or (b) running applications on the NAS itself (not over the network).

I run RAID-5 on my main NAS to keep the data reasonably available, but rely on backups to keep my data safe. That is also expensive, but in my opinion is safer than relying on RAID in any form.
Message 2 of 9
BaJohn
Virtuoso

Re: Just wondering..

@StephenB
I was trying to avoid a discussion about RAID10 versus RAID5/RAID6. It's all covered elsewhere.
Unfortunately you have taken into account the probability of failure and NOT how long the system is at risk, once 1 disk has failed.
RAID10 is very quick to rebuild, as it has a direct mirror. Other RAIDs require significantly more time to rebuild, and are hence at risk for longer.
The suggestion is that on large systems, it is ALMOST a certainty that you will have a second failure whilst rebuilding a RAID6 system.
I am not an expert, and do not propose that I know enough about it to argue persuasively with you.
I did look into it for many days and came to the conclusion that for medium to large systems, RAID5 should be dead, RAID6 dying and RAID10 the (current) way to go. (My words)
I will try and find some articles that I read about this, and post these tomorrow.
It's almost my bedtime 🙂
Message 3 of 9
dbott67
Guide

Re: Failures of various RAID modes.

Hi BaJohn (and StephenB),

With respect to the ReadyNASes, the one dual-disk failure was on a ReadyNAS Pro4 with 3 x 1 TB drives, IIRC configured in X-RAID2 (essentially a RAID5 expandable array). When the owner replaced the suspect drive (it had thrown a bunch of re-allocated sector errors) a second drive barfed during the resync and the unit went into life support mode.

I came onsite and a did a bit of troubleshooting, but determined that it would just be faster and easier to restore from backup. I had configured his offsite backup unit (kept at home) to be exactly the same as his work unit and we were doing nightly RSYNC backups, so we dropped it on his work network and they were back in business in about 5 minutes. Then, I reconfigured the other NAS with replacement drives and restored the data and configuration. After the backup job restored the data, he took his home unit back home and everything has been working smoothly since (a couple of years now).

The other failure was probably recoverable, but I was at the proverbial cross-roads, so to speak. During the RAIDiator development, there was a version that was released that required a factory default in order to take advantage of new features and I suffered a glitch that put my NAS in life support mode. Seeing as I had a complete backup, I decided that maybe the powers above wanted me to do factory reset and now was as good a time as any. Not a big issue for me, as it was on my home unit and I had a complete backup.

With respect to the failed servers, we had a few different problems.

1. Mail Server - Dell PowerEdge 2850 with PERC controller, RAID5 (3x146 GB) - Looks like the RAID controller died and seeing as I had just ordered the new integrated blade/SAN server (see below), I decided to restore onto a backup server and then migrate to VM upon arrival of new equipment. Re-installing the OS, mail software and restoring the data on the backup server took a few hours, but it was a minor inconvenience. When I migrated to the new VM hardware, I did fresh installs of the OS (Win2012-R2-DCE) and again restored from backup, and then cutover from old server to new VM with just a couple minutes of downtime to capture the last delta, restore and then swap IP addresses.

2. Phone PBX - last week, the solid state drive in our Mitel 3300 PBX died. I sent a copy of the backup to our vendor and they staged a new PBX for and then couriered it back to us. I dropped the replacement on the network and we were back in business.

3. SIP Server - last fall, one of our other Dell servers failed to start after a reboot. Again, I just fired up a backup server and restored the config and we were back online pretty quick.

At my place of work, we have some redundancy and resiliency, but we can't really afford some of the solutions that offer next-to-zero downtime with duplicate hardware, etc. We can tolerate a few hours of service interruption (i.e. our UPSes only offer around 30 minutes of protection) or the occasional failed hardware. In the event our main ILS system goes down, we have provisions to allow us to continue working and then upload the transactional data to the server when it comes back online.

What we can't afford is lost data, so I take great effort in making sure that the data is backed up and replicated off-site.

I recently purchased an integrated 4-bay blade server with 25-bay SAN storage. Currently, I've got it configured with 10 x 900 GB SAS drives in RAID50, with 2 hot spares for a total of 12 drives. This unit is hosting 8 VMs which are backed up to a Dell AppAssure DL-1000 appliance. I also backup just the data from various servers using RSYNC to the ReadyNAS 2100's and then replicate offsite. The DL-1000 provides a complete image of the VM plus snapshots every 3 hours.

The RSYNC backups allow me to quickly recover files at a granular level (plus 6 months worth of snapshots), as well as from a major catastrophe in our data centre (such as a fire) although it might take some time to recover, as we would have to order new equipment, stage the servers, restore the data and find a datacentre to host everything.

Below is the Dell VRTX with 2 x M620 blades (total 24 cores, 128 GB RAM) and 8 x 900 GB SAS drives (I've since added 4 more). The top unit is the Dell AppAssure DL-1000, followed by the VRTX, the ReadyNAS 2100 and then a few of our Dell PowerEdge 2850/2650/2550 servers (there are 10). Most of the PowerEdge servers have been or will be migrated over to the VRTX.

Message 4 of 9
StephenB
Guru

Re: Failures of various RAID modes.

"Should be dead" are charged words. It's probably more constructive to talk about our actual experiences with RAID and perhaps other NAS failures. I'll add my experience to dbott's.

I'm a home NAS user, and I've had a ReadyNAS since 2010 (and have had more than one for most of that time). At the moment I have 5 (which does seem a bit silly).

Altogether I've had 9 raid-years total with RAID-5 - 5 years with a 4x2TB in a NV+ and four more with 6x3TB in a pro-6. The NV+ is still using its original disks (after 5 years, with the head parking thresholds not tweaked). The pro array was expanded from 4x1.5TB, and has gone through at least 6 disk replacements (as the entire array was upgraded to WD30EFRX drives). I've never had a RAID-5 failure. Though I don't rely on RAID to keep my data safe, I think these stats do indicate that it is not even close to a certainty that you will have a second failure while rebuilding a RAID-5 system. Does it happen sometimes? Yes. Is it more likely to fail than not? In my experience clearly no.

The NV+ did fail last month btw (a PSU failure). It still had 3 months of warranty left, and I received a new PSU from Netgear which I installed yesterday. The array remained intact.

I also have had a Duo which began its life in RAID-1. That array did fail (OS partition corrupted in a unexpected power failure). As a result I purchased two UPS (and all my NAS are connected to a UPS).

I intended to rebuild the Duo as jbod, but accidentally ended up in spanning RAID-0, and left it. It failed earlier this summer (disk failure), and has been rebuilt again as jbod.

About 6 drives have failed in the various NAS over their lifetime, with a combined operation of about 54 disk-years. Four of these were 1.5 TB Seagates (and had been used in Windows systems before being put in to the NAS). One was a 2 TB Seagate, and one was a 3 TB Western Digital.

There are 16 drives currently installed in my various NAS - one Seagate and 15 Western digital. The oldest drives are about 5 years old, the newest was installed yesterday. Presently there are six 2 TB drives, eight 3 TB drives, and two 6 TB drives. Over most of the past 5 years I've had 12 disks installed.

So
(a) 9 years of operation with RAID-5, and so far no array failures and at least 6 disk replacements
(b) ~2 years of operation with RAID-1 with one failure (system related)
(c) ~2 years of operation with spanning RAID-0 with one failure (disk drive related)

(d) about 6 drives failed in various NAS, over a combined operation of about 54 disk-years. So far, this is about 1.2 failures per calendar year with 12-16 disks installed.
(e) 1 NAS failure over about 15 NAS-years of operation, which was fixed under warranty

I plan to stay with RAID-5 for the main NAS, and jbod for backups (though I'll keep the NV+ running RAID-5, probably until it fails) - keeping at least 3 copies of all data for safety. If I see a pattern of RAID-5 breakdowns here (correlated with larger drives) that could change my mind. I also expect that progress with built-in redundancy in btrfs will stabilize over the next year or so, and that will create some new options worth considering.
Message 5 of 9
BaJohn
Virtuoso

Re: Failures of various RAID modes.

Thanks 'dbott67' and 'StephenB' for the info.
I was going to quote various statements made in another forum, as to how I came to a decision about picking RAID10 (and my comments on RAID5).
As it happened the main person I would be quoting has just produced a document a few days ago which covers this (maybe as a result of my interaction with him).
So if anybody is interested, this is a covering link http://www.smbitjournal.com/?s=raid
with the first item being the one of interest http://www.smbitjournal.com/2015/03/practical-raid-choices-for-spindle-based-arrays.
Other articles of his are also of interest.
Please note, I am not trying to promote Mr Miller, am not an associate of his and know nothing about him except I have read a lot of his articles, and interfaced via a forum.
The articles appear to be well written, knowledgeable and informative, and as such I used his expertise to help me make my decision on what to buy.
Message 6 of 9
StephenB
Guru

Re: Failures of various RAID modes.

He just gives his recommendations w/o much underlying analysis. So its a bit hard to know what to do with it (other than decide to follow it blindly or not).

Also, when he compares RAID-6 to RAID-10 safety he is talking about very large arrays ("above roughly 40 TB when consumer drives are used"). That is, with 4 TB drives he is talking 12 drives or more with RAID-6. He specifically sets a 25-disk threshold (for using a storage consultant) further down. Lots of things change when you reach that scale. I'd agree that it is a risky to create a RAID-5 or RAID-6 array with that many disks. Resyncing a 12x4TB RAID-6 volume for instance requires reading/writing 48 TB. RAID-10 would require 20x4TB (a lot more drives), but when you replace a single drive the resync only requires reading/writing 8 TB (copying the associated 4 TB mirror drive).

When he talks about specifically about small business applications at the end (probably similar to most home scenarios) he is clear that RAID-6 is fine for smaller arrays. I've had good luck with RAID-5, and it offers better capacity and performance than RAID-6, though obviously it is not as available. I don't say "safe" because RAID will fail sometimes, and I don't view RAID-5 (or RAID-10) as "safe". RAID is convenient for volume aggregation, and usually keeps data available during disk replacements and expansion. But not safe enough to rely on 100%.

My overall impression from posts in this forum is that NAS failures (power glitches or hard failures) create data loss at least as often as disk drive failures. My own experience reflects this (one of each), perhaps other people can chime in with their history.
Message 7 of 9
dbott67
Guide

Re: Failures of various RAID modes.

I'll chime in with my advice (take it for what it's worth... not much! 😉 ).

The level of RAID that one chooses is a balance between price, performance, capacity and availability/resiliency. If you need higher availability/resiliency (especially on larger arrays) then you should use a level of RAID that can tolerate multiple disk failures. Of course, this decreases capacity or increases price (or both) but it generally increases performance. There are other factors that can cause data loss that RAID does not protect against (i.e. some other sort of hardware failure, accidental or intentional deletion, malware such as cryptolocker, fire/flood and theft). Additionally, there are other factors that would need to be addressed if high-availability were paramount (redundant power supplies, redundant network links, NICs, UPSes, etc.)

My big sermon is always to maintain multiple backups in multiple locations. Keeping only one copy of your data on a single device is not a backup. Having another copy of your data stored elsewhere will always allow you to recover in the event of a disaster.

For the average home user where there's going to be 4 drives or less, RAID 5 is most likely the best trade-off between price, performance and storage capacity. Again, this is for the average Joe.

For those people/organizations that require larger arrays, higher-availability and/or increased performance, then RAID 5 is not the recommended option as the risk of failure increases with # of disks. Of course, the cost per GB increases, but that generally goes with the territory. In my recent upgrade, I chose a balance between redundancy, capacity and price and decided to go with RAID 50 (I was unaware that it was deprecated).

As mentioned previously, I can tolerate *some* downtime to recover from a hardware failure. I *could* increase availability by buying redundant equipment and having some sort of heartbeat flip over to the backup hardware, but it's not within our financial means to do so. We would then also have to address the other areas of failure (network links, etc.) to be truly redundant. It's not something that our board would likely want to sign off on, as the price to purchase and maintain the equipment and links would be too great.

-Dave
Message 8 of 9
StephenB
Guru

Re: Failures of various RAID modes.

dbott67 wrote:
The level of RAID that one chooses is a balance between price, performance, capacity and availability/resiliency. If you need higher availability/resiliency (especially on larger arrays) then you should use a level of RAID that can tolerate multiple disk failures. Of course, this decreases capacity or increases price (or both) but it generally increases performance. There are other factors that can cause data loss that RAID does not protect against (i.e. some other sort of hardware failure, accidental or intentional deletion, malware such as cryptolocker, fire/flood and theft). Additionally, there are other factors that would need to be addressed if high-availability were paramount (redundant power supplies, redundant network links, NICs, UPSes, etc.)

My big sermon is always to maintain multiple backups in multiple locations. Keeping only one copy of your data on a single device is not a backup. Having another copy of your data stored elsewhere will always allow you to recover in the event of a disaster.
I totally agree.

I'd also add that high-availability is usually much more important to business users than it is to home users (since taking the business systems off-line will reduce both productivity and sales).
Message 9 of 9
Top Contributors
Discussion stats
  • 8 replies
  • 3552 views
  • 0 kudos
  • 3 in conversation
Announcements