× NETGEAR will be terminating ReadyCLOUD service by July 1st, 2023. For more details click here.
Orbi WiFi 7 RBE973
Reply

How does bitrot protection actually work?

StephenB
Guru

Re: How does bitrot protection actually work?

There are several posts on this topic, mostly speculative on the details.

What we know for sure is that

(a) BTRFS includes a checksum feature, and that is enabled when bitrot protection is on.
(b) if a checksum error occurs, then bitrot is detected, and the Netgear algorithm attempts to repair it from the RAID parity blocks. This is different from normal RAID repair, which is triggered by a read failure (not a checksum error).

One or two users have reported cases there this algorithm failed to recover data with a correct checksum. So far I have not seen any users reporting a case where it did recover data.
Message 26 of 46
BaJohn
Virtuoso

Re: How does bitrot protection actually work?

StephenB wrote:
There are several posts on this topic, mostly speculative on the details.

Hence my comments in this forum about having a definitive technical source.

StephenB wrote:
What we know for sure is that

(a) BTRFS includes a checksum feature, and that is enabled when bitrot protection is on.
(b) if a checksum error occurs, then bitrot is detected, and the Netgear algorithm attempts to repair it from the RAID parity blocks. This is different from normal RAID repair, which is triggered by a read failure (not a checksum error).

One or two users have reported cases there this algorithm failed to recover data with a correct checksum. So far I have not seen any users reporting a case where it did recover data.

BUT what prompts for a checksum error to be discovered?
i.e. Does BTRFS regularly do checksum testing unprompted? Is it only on a data write? Is it on a data read? etc
Message 27 of 46
StephenB
Guru

Re: How does bitrot protection actually work?

BTRFS generates the checksums on writes, and verifies them on reads. This feature is built into BTRFS itself.

Netgear's protection algorithm is their own, and they aren't saying much about how it works, other than what I said above. It's a unique feature for Netgear, and my guess is that they want to keep it that way.

There is a similar bitrot protection being built into BTRFS (using a raid-like mode that is integrated into the file system). But that raid-like mode is still experimental, and OS6 is using traditional software raid instead.
Message 28 of 46
BaJohn
Virtuoso

Re: How does bitrot protection actually work?

Thanks Stephen.
So to answer my own question:-
"If I put data on my ReadyNAS (with snapshots) and never updated it for 5 years, would bitrot be detected?"
Yes BUT only when I go to read some data, and more significantly only those blocks that are being read would be checked.
Then the BTRFS passes the error to the RNOS which (in my case - RAID10) would go off and repair with data from the mirror.
Thanks again.
Message 29 of 46
anonym
Aspirant

Re: How does bitrot protection actually work?

Hi, I've got an RN102 with only one disk installed.
The NAS initialized with a single volume under XRAID2 JBOD.
Checksums are enabled on the volume because the default set of shares have bitrot protection enabled.
On one of the shares, I disabled bitrot protection (since bitrot can't be fixed without a redundant copy) and set snapshots to never and then restored data on to it over usb.

I presume BTRFS has created checksums for these files? Does that mean that, even with bitrot protection disabled, BTRFS will detect bitrot and alert me to the problem?
Is there any benefit in enabling bitrot protection in this scenario?
Or should I be switching checksums off as well? (although I might add a second disk later...then I might have the option of enabling automatic bitrot protection on the existing data)

Thanks
Paul.
Message 30 of 46
StephenB
Guru

Re: How does bitrot protection actually work?

I believe checksums are enabled/disabled at the volume level only (at least I am not seeing any subvolume controls). If so, then btrfs should alert you to checksum failures.

I run jbod also, but have bitrot protection and snapshots enabled on shares. (if you want CoW on all the time, then you do need bitrot protection enabled, since the features are coupled).
Message 31 of 46
sgogo
Aspirant

Re: How does bitrot protection actually work?

Do you guys know if using non-ECC memory with BTRFS and bitrot protection "on" has the ability to damage good data?

There has always been some discussion with ZFS that a scrub with non-ECC memory could potentially re-write the entire drive with bad data. I do not think that is true based on the method ZFS uses to replace data... it will not overwrite unless the new data iss confirmed good.

However, I do not understand the methodology for BTRFS scrubbing... is it the same as ZFS? Could a bad memory module without ECC casue the BTRFS to scrub the disk(s) with bad data?
Message 32 of 46
StephenB
Guru

Re: How does bitrot protection actually work?

Bad memory in the NAS can always corrupt the file system. It doesn't matter what the file system is (ZFS, BTRFS, EXT, ...).
Message 33 of 46
sgogo
Aspirant

Re: How does bitrot protection actually work?

Yes, but there has been discussion that non-ecc memory could corrupt an entire volume systematically.

During a scrub, the data is regularly checked for bit-rot and if the memory is bad, it would calculate bit rot (incorrectly)...then replace good data with bad data.

ZFS wont do that based on the way it replaces the "bad" memory spot (It would have to find something it calculates as "good" to replace the bad, and bad memory would never find something "good" to use).

But how does Netgear's implementation of BTRFS do that?
Message 34 of 46
StephenB
Guru

Re: How does bitrot protection actually work?

Well, I am very skeptical that the bad non-ecc memory would magically fail systematically on the checksum and nothing else.

Having said that, I believe Netgear's implementation repairs bitrot using the RAID protection. So if the checksum fails to validate, it attempts to rebuild the sector that failed from the other RAID blocks in that stripe. If that doesn't result in a checksum that passes, then the bitrot repair fails. That sounds similar to the way you describe ZFS.

Of course, once either approach finds something good to use, that pesky bad memory might just corrupt it before it gets rewritten. So I am sticking with my position that bad memory can corrupt any file system. It seems to me that the simplest way to corrupt the volume with bad memory is on the initial write (the data being corrupted in memory before it is ever written to the disk).

If you are concerned about the impact of non-ecc memory, then perhaps buy a readynas that has ecc (e.g., the RN516)
Message 35 of 46
BaJohn
Virtuoso

Re: How does bitrot protection actually work?

StephenB wrote:
If you are concerned about the impact of non-ecc memory, then perhaps buy a readynas that has ecc (e.g., the RN516)

Just to satisfy my curiosity as I have RN516.
If there was an error in the eec memory, would I know about it, or would it be hidden from the user?
Message 36 of 46
StephenB
Guru

Re: How does bitrot protection actually work?

I don't know for sure, but I believe corrected errors are logged.
Message 37 of 46
sgogo
Aspirant

Re: How does bitrot protection actually work?

StephenB wrote:
...

If you are concerned about the impact of non-ecc memory, then perhaps buy a readynas that has ecc (e.g., the RN516)


I already have 3 ReadyNAS without ECC, so that is not an option until my next purchase.

I understand that bad memory can corrupt anything it writes, but most of my data is write once, save a long time, read often. Business records, photos, etc.

My concern is turning the bit rot protection on and then, due to bad memory, having every checksum fail during a scrub. This could conceivably cause a re-write of an entire disk with the bad memory. Then in one shot I have corrupted everything.

What do you think?
Message 38 of 46
mdgm-ntgr
NETGEAR Employee Retired

Re: How does bitrot protection actually work?

I don't think that's possible. If the checksums at the filesystem level are all bad then one would expect the checksums at the md level to all be bad as well.

In any case bitrot protection is a great feature, but backups are still important. No important data should be stored on just the one device.
Message 39 of 46
sgogo
Aspirant

Re: How does bitrot protection actually work?

mdgm wrote:
I don't think that's possible. If the checksums at the filesystem level are all bad then one would expect the checksums at the md level to all be bad as well.


I think I understand...

Just so I am clear, the process would be that the bit rot protection routine first checks the primary data, then, if it finds an error, it goes out to the redundant data location and checks THAT data.

It will only write from the redundant location to the original location if it finds a correct checksum at the redundant location.

If it finds an incorrect checksum at both locations (it will find both locations incorrect, since the memory is defective) then no data is written and an error is generated.

This is the way the ZFS system works and inspires confidence. Do I have it correct?


mdgm wrote:
In any case bitrot protection is a great feature, but backups are still important. No important data should be stored on just the one device.


I am with you. Minimum of three (3) copies with at least one off site.

However, you can easily corrupt multiple copies if your primary source gets damaged by the file system and you do not know. As an example:

-On day 1, I have (3) 1TB drives A, B, & C with the same info.

-On day 2, copy A is damaged systematically by the file system without me knowing (but the drive is fine with no SMART errors).

-On day 3, drive B fails in the normal way, so I copy my data from drive A to drive B.
Message 40 of 46
mdgm-ntgr
NETGEAR Employee Retired

Re: How does bitrot protection actually work?

sgogo wrote:

Just so I am clear, the process would be that the bit rot protection routine first checks the primary data, then, if it finds an error, it goes out to the redundant data location and checks THAT data.

It will only write from the redundant location to the original location if it finds a correct checksum at the redundant location.

If it finds an incorrect checksum at both locations (it will find both locations incorrect, since the memory is defective) then no data is written and an error is generated.

This is the way the ZFS system works and inspires confidence. Do I have it correct?

Yes
Message 41 of 46
sgogo
Aspirant

Re: How does bitrot protection actually work?

Thanks mdgm!
Message 42 of 46
mdgm-ntgr
NETGEAR Employee Retired

Re: How does bitrot protection actually work?

Actually I enquired with one of our product engineers and had a clarification:

If there is a filesystem checksum mismatch, we try to re-assemble that RAID stripe in different ways until we get a checksum match. If we never get a checksum match, we give up and inform the user that we detected an error but couldn't correct it, as you've seen reported elsewhere. We never generate data to make the data match the checksum. It's about as safe as it can get.
Message 43 of 46
sgogo
Aspirant

Re: How does bitrot protection actually work?

Mdgm-

That is great news! Thanks for following up!

SteveG
Message 44 of 46
ReadyJustin
Guide

Re: How does bitrot protection actually work?

Just got a RN316 and found this great thread. Thanks everyone for your helpful discussion of technical details. I plan on running a memtest any time before scrubbing.

 

Was also wondering about what scenarios bit rot protection would work, especially the scenarios snakyjake propsed in message 6.

 

Has anyone managed to find out any info about this from Netgear? It would be nice to know specifically if it will ever verify against checksums of data in snapshots, and not just RAID parity.

 

Is there a notification any time there's a checksum mismatch and not just when an error couldn't be corrected, or just a log entry generated in that scenario?

 

I did find a document at an EU Netgear site where it explains that BTRFS corrupt checksum events are sent to the md layer to find the correct data as mdgm said (thanks also for that information about use cases): ReadyNAS_Bit_Rot_Protection_Overview.pdf

 

This sheds a little more light on the mechanism, but not necessarily the technicals. Hope it gives slightly more clarity to the OP's question.

Message 45 of 46
pec967
Luminary

Re: How does bitrot protection actually work?

Can bit rot protection be enabled on Home folders in OS 6.4.x? Unlike shared folders, there is not a check box to enable bit rot protection on Home folders.

 

I replaced a ReadyNAS Duo v1 with a RN312 in a RAID 1 configuration last year for home use. I only use the RN312 for user backups, and these backup files are written to folders in the users' Home directories. I then use ReadyNAS Vault to provide an off-site backup. I would certainly like to enable bit rot protection since many of the photos and music files in these incremental backups are static. Perhaps I need to create shared folders and then adjust the permissions to only allow access by the individual users?

 

I don't understand why ReadyNAS does not share more information on exactly how checksums, bit rot, and RAID reconstruction with checksums works in OS 6 for RAID 1, 5, and 6. While the information in this thread is helpful, I still have a number of questions. In the Enterprise storage array space, vendors like NetApp have always published in the open literature the specifics of how their checksums (data and metadata), file identity blocks, raid scrubbing, and bit rot work. For example,  a Usenix paper in 2008 reported data for three years on 1.5 million disk drives in NetApp storage arrays at customer sites. Over this time, they found 400,000 checksum errors, of which 8% were discovered during RAID reconstruction often leading to data loss, The file identiy blocks identified an order of magnitude smaller number of errors due to things like lost or misdirected writes. The superior error handling performance of OS 6 is a selling point for ReadyNAS, particularly given the slower performance for the price compared to the competition, and you should step up and let your customers understand exactly how it works.

Message 46 of 46
Top Contributors
Discussion stats
  • 45 replies
  • 14367 views
  • 1 kudo
  • 12 in conversation
Announcements