Forum Discussion

Retired_Member

Sep 02, 2016

Solved

Is XRaid2 dual-redundancy more reliable than RAID6?

1. Or is it basically just RAID6 under the covers? I read somewhere that ReadyNAS used its own scripts to handle expansion of capacity and a few other things for XRaid2, but I didn't know if that mea...

Tips from other users

mdgm-ntgr
Sep 02, 2016
X-RAID2 dual-redundancy uses RAID-6. You would disable X-RAID, delete the default volume, create a RAID-6 volume (preferably called "data" - no quotes) then turn X-RAID back on.

We use mdadm RAID. We have bit-rot protection (enabled at a per share level), so if in normal use of the NAS, the checksum fails then we will attempt to repair the problem by examining the mdadm layer. If the checksum is good for the mdadm layer we will then use that to repair the problem. There is scheduled volume maintenance that can be run e.g. scrubbing. With bit-rot protection enabled scrubbing would check to make sure that the data is good and see if it can fix problems.

UREs can mess up a RAID rebuild. RAID-6 can withstand problems with two disks only. So if you are rebuilding it could only survive problems with one of the other disks that was already in the NAS before adding the replacement disk. It might be possible to try to attempt data recovery in a situation where a URE brings down the RAID but this may be partially or wholly unsuccessful. Backups are important.

I think at this point with RAID-6 you're unlikely to need to do a full restore from backup but best to be prepared just in case.

Retired_Member

Sep 02, 2016

Perfect, I was eventually able to get some data from some colleagues who work for a fairly well-known storage company. Apparently even RAID5 is still used in the enterprise today with larger capacity drives (despite what some articles have claimed) since they often have teams that can recover from a low level failure.

Though that may not be provided directly thorugh ReadyNAS' interface, it sounds like the RAID solution here is handled with mdadm, and offers Scrubbing even, so feeling pretty good with RAID6 then. Thanks for the insight

Retired_Member

Sep 02, 2016

Apparently some drives have a feature called TLER, which help keep RAID controllers from failing an entire drive, if it encounters a single URE during rebuild. Surely this would help against the 14TB failure probability that is often quoted for consumer drives:

"RAID-specific, time-limited error recovery (TLER) - NASware 3.0 also prevents hard drives from being dropped off the RAID due to extended error recovery. This provides more availability and less down time rebuilding the RAID."
"RAID controllers are designed to drop hard disk drives from a RAID array if they are unresponsive for more than a few seconds. This is done in the assumption that such a drive has either malfunctioned or is no longer reliable. Depending on the type of RAID array, it can cause the array to lose its data redundancy or force a rebuild of the RAID array from parity data."

"To prevent this, NASware 3.0 supports TLER, which is short for Time-Limited Error Recovery. It limits the read error recovery process in the Red drive to just 7 seconds, after which the error recovery process is aborted. This prevents the RAID controller from marking the drive as unreliable and dropping it from the RAID array. The RAID controller can also take over whatever error recovery is left to perform."

This is part of WD's new NASware 3.0 which is in their newer Red drives. Luckily this happens to be one of the drives I picked up. And that's probably a good thing, since Backblaze shows this model as having a 6% annual error rate (the sample of drives they had was relatively small, and inconsistent, so I'm a little skeptical of these stats).

StephenB
Guru - Experienced User
Sep 03, 2016
@loganmu wrote:

Apparently some drives have a feature called TLER, which help keep RAID controllers from failing an entire drive,

TLER just limits the recovery time: In computing, error recovery control (ERC) (Western Digital: time-limited error recovery (TLER), Samsung/Hitachi: command completion time limit (CCTL)) is a feature of hard disks which allow a system administrator to configure the amount of time a drive's firmware is allowed to spend recovering from a read or write ... (https://en.wikipedia.org/wiki/Error_recovery_control)

Netgear uses software RAID btw.

@loganmu wrote:

Surely this would help against the 14TB failure probability that is often quoted for consumer drives:

The URE spec is "Non-recoverable read errors per bits read" For consumer drives it generally is <1 in 10**14 That would be 12.5 trillion bytes not 14. But its really the whole sector that isn't read, not just one bit.

The people who use this stat generally assume that the drives exactly meet the spec (=1 in 10**14, not < 1 in 10**14). That is why I think they are getting a wrong result.

@loganmu wrote:

having a 6% annual error rate (the sample of drives they had was relatively small, and inconsistent, so I'm a little skeptical of these stats).

I am too. The stats are a useful relative guide (especially if you see a model with huge failure rates), but I'm not sure they measure small differences that accurately. Also the BackBlaze pod is a very different environment from the ReadyNAS.
- Retired_Member
  Sep 03, 2016
  Having TLER or CCTL (yup ERC =) ) makes a HUGE difference here, even with a software RAID solution. mdadm still has a timeout period. A drive that has this feature will be able to encounter a lot more URE's without upsetting the RAID rebuild, since it gives it an opportunity to correct the error at the next level instead of just taking the drive offline.
  
  Otherwise the RAID solution can fail the drive and stop the rebuild with as little as 1 URE (RAID5) or 2 subsequent URE's (RAID6) when they exceed the timeout. So TLER doesn "just limits the recovery time" ... it "limits the recovery time" =)
  
  14TB was a typo, but yeah 12.5 vs. 14 makes no difference in the scheme of having a feature like this. Even with the most pessimistic interpretation of the stat (which is what some people go for, given these manufacturer-provided numbers can be pretty optimistic) tolerance and recoverability is still at least possible if it happens.
  - StephenB
    Guru - Experienced User
    Sep 03, 2016
    @loganmu wrote:
    
    Having TLER or CCTL (yup ERC =) ) makes a HUGE difference here, even with a software RAID solution. mdadm still has a timeout period.
    
    I agree its an important feature, for both read errors (which I guess is covered by UREs) and write errors (which are not). Some desktop drives don't have it, NAS-purposed and enterprise drives will (from all manufacturers).