I understand the concept, but am curious how it is actually implemented on 6.2.0

[quote="http://en.wikipedia.org/wiki/Copy-on-write":1cqyxbmf]Copy-on-write (sometimes referred to as "COW") is an optimization strategy used in computer programming. Copy-on-write stems from the understanding that when multiple separate tasks use initially identical copies of some information (i.e., data stored in computer memory or disk storage), treating it as local data that they may occasionally need to modify, then it is not necessary to immediately create separate copies of that information for each task. Instead they can all be given pointers to the same resource, with the provision that on the first occasion where they need to modify the data, they must first create a local copy on which to perform the modification (the original resource remains unchanged). [/quote:1cqyxbmf]This is the core idea. When snapshots are taken, the snapshot folder and the main folder both initially have pointers to the same data on the disk. The btrfs wiki calls this "cloning" to distinguish it from linux hard links. When the file is modified, then the file is fragmented, so the unchanged blocks remain referenced by both folders. For the blocks that have been changed, the original block ends up referenced only by the snapshot, and the changed block is referenced by the main folder.When you have multiple snapshots, the idea is simply extended to cover them all.CoW is not limited to snapshots, there is a --reflink option in the cp command which has the same properties. Initially the two copies share the same datablocks, but as the files are modified only the shared blocks remain in common - resulting in fragmentation, but efficient use of disk space.I'm not sure why Netgear linked bit-rot protection to CoW, it is an odd admixture. From what little has been posted here, bit-rot protection depends on btrfs file checksums and RAID, not CoW.The obvious use of CoW is to create snapshots, which is a space-efficient mechanism that allows you to roll back to previous versions of the files. If you have large files with a few differences between them, then CoW could be used (e.g. cp --reflink) to reduce disk space. If you have a folder structure that contains source code, CoW is one way to create a development branch - again one that is space efficient. It isn't well suited to files that are being continuously updated (for instance torrent files being downloaded, or databases that are always changing). Snapshots and cp --reflink are very fast operations; the performance hit happens later on when the files are modified.

As per release notes, "Support bitrot data protection. Automatically detect and correct corruption due to media degradation.", for me this means that as soon as you enable BitRot protection, the NAS will scan-detect-fix corruptions on the data.

Yea, I guess I was more curious what the mechanism is; is it a BTRFS feature? Something else?

Nhellie wrote:As per release notes, "Support bitrot data protection. Automatically detect and correct corruption due to media degradation.", for me this means that as soon as you enable BitRot protection, the NAS will scan-detect-fix corruptions on the data.Yes. But the "how" hasn't been disclosed, and that is what Scott is asking. We know its not using the btrfs experimental modes, so it appears to be a proprietary technique Netgear implemented that does something similar.It would be useful to have more information, so people will have a better idea what it can/can't do. It could be quite useful in some circumstances (for instance reducing data loss when disk cloning is needed). But its hard to know, w/o some explanation.

Yea, thanks. exactly what I was getting at...

How does bitrot protection actually work?

45 Replies

Replies have been turned off for this discussion

StephenB
Guru - Experienced User
Feb 21, 2015
Well, I am very skeptical that the bad non-ecc memory would magically fail systematically on the checksum and nothing else.

Having said that, I believe Netgear's implementation repairs bitrot using the RAID protection. So if the checksum fails to validate, it attempts to rebuild the sector that failed from the other RAID blocks in that stripe. If that doesn't result in a checksum that passes, then the bitrot repair fails. That sounds similar to the way you describe ZFS.

Of course, once either approach finds something good to use, that pesky bad memory might just corrupt it before it gets rewritten. So I am sticking with my position that bad memory can corrupt any file system. It seems to me that the simplest way to corrupt the volume with bad memory is on the initial write (the data being corrupted in memory before it is ever written to the disk).

If you are concerned about the impact of non-ecc memory, then perhaps buy a readynas that has ecc (e.g., the RN516)
sgogo
Aspirant
Feb 21, 2015
Yes, but there has been discussion that non-ecc memory could corrupt an entire volume systematically.

During a scrub, the data is regularly checked for bit-rot and if the memory is bad, it would calculate bit rot (incorrectly)...then replace good data with bad data.

ZFS wont do that based on the way it replaces the "bad" memory spot (It would have to find something it calculates as "good" to replace the bad, and bad memory would never find something "good" to use).

But how does Netgear's implementation of BTRFS do that?
StephenB
Guru - Experienced User
Feb 21, 2015
Bad memory in the NAS can always corrupt the file system. It doesn't matter what the file system is (ZFS, BTRFS, EXT, ...).
sgogo
Aspirant
Feb 20, 2015
Do you guys know if using non-ECC memory with BTRFS and bitrot protection "on" has the ability to damage good data?

There has always been some discussion with ZFS that a scrub with non-ECC memory could potentially re-write the entire drive with bad data. I do not think that is true based on the method ZFS uses to replace data... it will not overwrite unless the new data iss confirmed good.

However, I do not understand the methodology for BTRFS scrubbing... is it the same as ZFS? Could a bad memory module without ECC casue the BTRFS to scrub the disk(s) with bad data?
StephenB
Guru - Experienced User
Feb 17, 2015
I believe checksums are enabled/disabled at the volume level only (at least I am not seeing any subvolume controls). If so, then btrfs should alert you to checksum failures.

I run jbod also, but have bitrot protection and snapshots enabled on shares. (if you want CoW on all the time, then you do need bitrot protection enabled, since the features are coupled).
anonym
Aspirant
Feb 17, 2015
Hi, I've got an RN102 with only one disk installed.
The NAS initialized with a single volume under XRAID2 JBOD.
Checksums are enabled on the volume because the default set of shares have bitrot protection enabled.
On one of the shares, I disabled bitrot protection (since bitrot can't be fixed without a redundant copy) and set snapshots to never and then restored data on to it over usb.

I presume BTRFS has created checksums for these files? Does that mean that, even with bitrot protection disabled, BTRFS will detect bitrot and alert me to the problem?
Is there any benefit in enabling bitrot protection in this scenario?
Or should I be switching checksums off as well? (although I might add a second disk later...then I might have the option of enabling automatic bitrot protection on the existing data)

Thanks
Paul.
BaJohn
Virtuoso
Feb 16, 2015
Thanks Stephen.
So to answer my own question:-
"If I put data on my ReadyNAS (with snapshots) and never updated it for 5 years, would bitrot be detected?"
Yes BUT only when I go to read some data, and more significantly only those blocks that are being read would be checked.
Then the BTRFS passes the error to the RNOS which (in my case - RAID10) would go off and repair with data from the mirror.
Thanks again.
StephenB
Guru - Experienced User
Feb 16, 2015
BTRFS generates the checksums on writes, and verifies them on reads. This feature is built into BTRFS itself.

Netgear's protection algorithm is their own, and they aren't saying much about how it works, other than what I said above. It's a unique feature for Netgear, and my guess is that they want to keep it that way.

There is a similar bitrot protection being built into BTRFS (using a raid-like mode that is integrated into the file system). But that raid-like mode is still experimental, and OS6 is using traditional software raid instead.

BaJohn

Virtuoso

Feb 16, 2015

StephenB wrote:
There are several posts on this topic, mostly speculative on the details.

StephenB wrote:
There are several posts on this topic, mostly speculative on the details.

Hence my comments in this forum about having a definitive technical source.

StephenB wrote:
What we know for sure is that

(a) BTRFS includes a checksum feature, and that is enabled when bitrot protection is on.
(b) if a checksum error occurs, then bitrot is detected, and the Netgear algorithm attempts to repair it from the RAID parity blocks. This is different from normal RAID repair, which is triggered by a read failure (not a checksum error).

One or two users have reported cases there this algorithm failed to recover data with a correct checksum. So far I have not seen any users reporting a case where it did recover data.

StephenB wrote:
What we know for sure is that (a) BTRFS includes a checksum feature, and that is enabled when bitrot protection is on. (b) if a checksum error occurs, then bitrot is detected, and the Netgear algorithm attempts to repair it from the RAID parity blocks. This is different from normal RAID repair, which is triggered by a read failure (not a checksum error). One or two users have reported cases there this algorithm failed to recover data with a correct checksum. So far I have not seen any users reporting a case where it did recover data.

BUT what prompts for a checksum error to be discovered?
i.e. Does BTRFS regularly do checksum testing unprompted? Is it only on a data write? Is it on a data read? etc

StephenB
Guru - Experienced User
Feb 16, 2015
There are several posts on this topic, mostly speculative on the details.

What we know for sure is that

(a) BTRFS includes a checksum feature, and that is enabled when bitrot protection is on.
(b) if a checksum error occurs, then bitrot is detected, and the Netgear algorithm attempts to repair it from the RAID parity blocks. This is different from normal RAID repair, which is triggered by a read failure (not a checksum error).

One or two users have reported cases there this algorithm failed to recover data with a correct checksum. So far I have not seen any users reporting a case where it did recover data.

Forum Discussion

How does bitrot protection actually work?

45 Replies

Related Content

NETGEAR Nighthawk - Protection Engine Feature FAQ

Loop protection

DDoS protection

Enable Bitrot on Files: MV or CP?

Find out client actual connection type WPA2 or WPA3

NETGEAR Academy

ProSupport for Business