Re: How does bitrot protection actually work?

ScottChapman · ‎2014-12-10

I understand the concept, but am curious how it is actually implemented on 6.2.0

Nhellie · ‎2014-12-10

As per release notes, "Support bitrot data protection. Automatically detect and correct corruption due to media degradation.", for me this means that as soon as you enable BitRot protection, the NAS will scan-detect-fix corruptions on the data.

ScottChapman · ‎2014-12-10

Yea, I guess I was more curious what the mechanism is; is it a BTRFS feature? Something else?

StephenB · ‎2014-12-10

Nhellie wrote:
As per release notes, "Support bitrot data protection. Automatically detect and correct corruption due to media degradation.", for me this means that as soon as you enable BitRot protection, the NAS will scan-detect-fix corruptions on the data.

Yes. But the "how" hasn't been disclosed, and that is what Scott is asking. We know its not using the btrfs experimental modes, so it appears to be a proprietary technique Netgear implemented that does something similar.

It would be useful to have more information, so people will have a better idea what it can/can't do. It could be quite useful in some circumstances (for instance reducing data loss when disk cloning is needed). But its hard to know, w/o some explanation.

ScottChapman · ‎2014-12-10

Yea, thanks. exactly what I was getting at...

snakyjake · ‎2014-12-20

I'd like to know too. I currently run checksums on all my files, and store the results. Periodically I recalc the checksums, and compare against the stored checksum. This at least tells me there's been bit corruption. The next trick is to restore a good copy. You know you have a good copy when the copy matches the original checksum. What you don't want to have done is backed up the rotted file, and all you have left is that rotted file. ReadyNAS has a versioning backup feature. So as time goes on, and bits are changed, ReadyNAS probably has multiple backups of both the good and bad file. So I'm guessing ReadyNAS either prevented that from happening in the first place, or is able to locate the good file (by comparing checksums).

This is a pretty important topic for me. I'm concerned a lot about silent errors.

It would be great if Netgear would do a good write up demonstrating some scenarios. I don't need to understand how, but I need to see the scenario proven. For example, use a hex editor and change the file (but keep the file size/date the same). ReadyNAS should detect the change, and should restore the file. The other scenario is the reverse...what happens when the stored checksum is corrupted?

Jake

snakyjake · ‎2014-12-20

After looking at the different models, I don't have a lot of confidence in the cheaper non-ECC models. For a prosumer, it gets quite expensive.

StephenB · ‎2014-12-21

snakyjake wrote:
After looking at the different models, I don't have a lot of confidence in the cheaper non-ECC models. For a prosumer, it gets quite expensive.

Well, of course the ECC Ram costs more for Netgear to buy, and on top of that you need other system components that support the RAM. The lower end of their market is price sensitive (it always is), so I think from a business point of view it likely isn't viable for them to include ECC there.

For better or worse, most consumers simply aren't as concerned about silent bit rot as you are, so ECC is not a must-have feature for them. If enterprises started demanding it in all devices, that would likely change - but I haven't seen much that suggests that will happen anytime soon.

That said, I think Netgear should describe what they are doing in more detail.

nsne · ‎2014-12-28

I'm still a bit puzzled about the basics of CoW. I understand it has the potential to be A Good Thing®, but I'm not sure what the optimal operating conditions would be.

For example, is CoW good — or even necessary — for media shares with lots of large (ca. 2GB) video files that don't get modified very often? Is it for documents? How about an iTunes share that contains a little bit of everything and is accessed and updated (with apps, podcasts) daily? How much of a performance hit will it bring to a 314?

I disabled bit-rot protection across the board long ago when I was trying to eke some decent performance out of the 314, but I'd also like to ensure data integrity if the feature set allows it.

mdgm-ntgr · ‎2014-12-28

Features such as CoW and bitrot protection can be very useful for files that are not modified often. Bitrot protection provides good protection against media degradation.

I guess through a bit of trial and error you may find what works for you. Obviously there is a performance hit. So whether you use them does depend on how much you value extra protection for your data compared with performance.

With settings at a per share level you can choose which shares to use these features with.

It would be advisable to make use of scheduled volume maintenance.

StephenB · ‎2014-12-28

[quote="http://en.wikipedia.org/wiki/Copy-on-write":1cqyxbmf]Copy-on-write (sometimes referred to as "COW") is an optimization strategy used in computer programming. Copy-on-write stems from the understanding that when multiple separate tasks use initially identical copies of some information (i.e., data stored in computer memory or disk storage), treating it as local data that they may occasionally need to modify, then it is not necessary to immediately create separate copies of that information for each task. Instead they can all be given pointers to the same resource, with the provision that on the first occasion where they need to modify the data, they must first create a local copy on which to perform the modification (the original resource remains unchanged). [/quote:1cqyxbmf]
This is the core idea. When snapshots are taken, the snapshot folder and the main folder both initially have pointers to the same data on the disk. The btrfs wiki calls this "cloning" to distinguish it from linux hard links.

When the file is modified, then the file is fragmented, so the unchanged blocks remain referenced by both folders. For the blocks that have been changed, the original block ends up referenced only by the snapshot, and the changed block is referenced by the main folder.

When you have multiple snapshots, the idea is simply extended to cover them all.

CoW is not limited to snapshots, there is a --reflink option in the cp command which has the same properties. Initially the two copies share the same datablocks, but as the files are modified only the shared blocks remain in common - resulting in fragmentation, but efficient use of disk space.

I'm not sure why Netgear linked bit-rot protection to CoW, it is an odd admixture. From what little has been posted here, bit-rot protection depends on btrfs file checksums and RAID, not CoW.

The obvious use of CoW is to create snapshots, which is a space-efficient mechanism that allows you to roll back to previous versions of the files. If you have large files with a few differences between them, then CoW could be used (e.g. cp --reflink) to reduce disk space. If you have a folder structure that contains source code, CoW is one way to create a development branch - again one that is space efficient. It isn't well suited to files that are being continuously updated (for instance torrent files being downloaded, or databases that are always changing). Snapshots and cp --reflink are very fast operations; the performance hit happens later on when the files are modified.

BaJohn · ‎2015-02-01

StephenB:- I get the impression that you would recommend that 'bitrot protection' be turned ON. (Please say YES or NO)
If it is ON, does this cause extra wear and tear on the disks.
If it is OFF, are the snapshots real copies of the data rather than 'links' as for 'bitrot protection' ON.
I currently have 'bitrot protection' OFF and snapshots of 2GB are almost instantaneous.
I liked your comments, but am still not entirely sure what goes on, and the difference between when the facility is ON and OFF.
Confused!

StephenB · ‎2015-02-01

I leave it on. Even if file errors can't be repaired, it will at least warn if when it finds something wrong. That might be a false alarm, but still it seems better to get the warning. Though honestly, I think that silent bitrot is quite rare, and its not something I'm particularly concerned about in my OS 4 NAS. One reason I leave it on is just to see if it ever finds anything...

The cost of bitrot protection itself is that btrfs checksums are enabled. So they are checked during file reads, and updated during file writes. There might be a small performance hit on the RN516 - its easy enough to measure any performance impact, since you can create two shares; enabling on one but not the other.

Creating snapshots is always near-instantaneous. Any performance problems they create happen later on, when files in the main share are modified. Then files in the main share get fragmented - and that fragmentation eventually migrates to the snapshots. I leave daily snapshots enabled on most shares despite that possibility. But most of my files aren't modified in place, and I do have defrag, balance, scrubs, and disk checks scheduled in the maintenance schedule (each is run once every three months on each volume).

But if you do have files that are frequently modified in place (downloading torrents, SQL databases, or something like that), then you will want to turn bitrot protection off on those shares. That's becaue Netgear links CoW with BTRFS checksum protection, the GUI doesn't control them independently. The problem is the CoW fragmentation, not the checksum overhead.

To your specific questions:
-bitrot protection by itself does not cause extra wear and tear on the disk. If you end up with fragmentation, then the disks will do more seeking when you read the files, and the balance and defrag maintenance tasks will likely take longer to complete. I don't think that hurts disk life, at least I've never seen studies that claim that.

-snapshots always share file datablocks with each other and the parent share. Bitrot protection doesn't change that. If bitrot protection is off, then CoW is turned on before every snapshot is taken, and turned off after the snapshot completes. If bitrot protection is on, then CoW is on all the time.

Scram · ‎2015-02-01

To start with: bitrot protection (knowing it is a feature of btrfs) was one of my selection criteria for a ReadyNas 104.

I was quite disappointed as i recognized it wasn't supported with ReadyNas OS <6.2 as i started and i fiddled a bit on system level to check what i can do, and manually reballanced the btrfs filesystem behind the JBOD X-Raid mode of readynas to have BTRFS-RAID-1 that supports bit-rot-protection.

The Bit-Rot-Protection that is now available in Readynas OS6.2 is very weird and doesn't make sense to me: I currently have a JBOD volume over 2 3TB Drives.

The UI allows me to turn on Bit-Rot-Protection on share level. I can't understand how this could work at all - it can give me an message if a file is corrupted, but it can't protect me against it and rebuild mismatching blocks, as no redundancy is enabled at the moment.

I tryed to peek behind the User Interface to get a knowledge of what is going on... but really, i don't know exactly.

There is a interesting utility & daemon: mdcsrepair & mdcsrepaird. It seems to be tied into the linux md system somehow... The name could mean "md checksum repair".
This would be basically what bitrot protection stands for, but i don't understand how it can be enabled per subvolume of the btrfs filesystem, when the md volumes are beneath the btrfs filesystem.

Some clarification of the technical point of "bit-rot-protection" in ReadyNas would be greatly appreciated!

mdgm-ntgr · ‎2015-02-01

You need to be using a md raid level that provides redundancy.

Our bit rot protection requires redundancy at the md raid layer whereas BTRFS by itself would require redundancy on the BTRFS layer. We achieve the same thing but in a different way.

We use btrfs checksums and use the md layer to find the correct block.

Bitrot protection is suitable for some use cases and can be quite resource intensive. So only using it on shares for which it is appropriate to use it with is good.

Bitrot protection works best with data that has minimal in place modifications. it is there to protect against degradation.

StephenB · ‎2015-02-02

mdgm wrote:
Bitrot protection is suitable for some use cases and can be quite resource intensive. So only using it on shares for which it is appropriate to use it with is good.

Overall, I think its frustrating for RN100 series users to be told repeatedly "hey, all these features in the GUI - they aren't for you...". Netgear should provide some real data on the performance impacts of checksums, AntiVirus, RAID-6 on all platforms. Then users can figure out if the benefits outweigh the costs for them, and make better choices on their platform investments,

Do you mean the repair is resource intensive, or the the protection itself (checksums + CoW)?

I'm not seeing much impact on my RN102, which leads me to think that the issue you are talking about is CoW fragmentation. That of course is not easily quantified - and fragmentation kills performance on all platforms.

mdgm-ntgr · ‎2015-02-02

Seeing repair is not all that constant I would think it would be the CoW fragmentation as you mentioned.

StephenB · ‎2015-02-02

Fragmentation doesn't happen more on slower platforms, but defrag time of course is longer. I think all OS6 users need to be thoughtful about CoW. If there's a way to report share-by-share fragmentation (and maybe even have a fragmentation alert) in OS6, that would be a good thing.

Generally speaking, on the lower end NAS you do need to trade off performance against features. That's the "cost" of getting the less expensive platform.

But there are lots of scenarios where this is acceptable for home users. If you are using WiFi, powerline, or fast ethernet then you are usually network bound anyway, so the performance hit isn't something you'll experience. And speed isn't the only consideration. Based on posts here, I'd personally avoid RAID-6 and SCSI LUNs on the RN100 series. But I see no reason to avoid the other features, as long as people understand that performance will drop if you turn on too much stuff.

Netgear already recommends RN300 or better for business users, which makes perfect sense to me. Some business can probably get by with less, but slower performance does translate into $$$ for most businesses.

BaJohn · ‎2015-02-02

mdgm wrote:
You need to be using a md raid level that provides redundancy.

Does that include RAID10 as like what I have on my RN516 with 6 x 4TB.
Or are you talking about RAID5 and RAID6 only.

StephenB · ‎2015-02-02

BaJohn wrote:
mdgm wrote:
You need to be using a md raid level that provides redundancy.

Does that include RAID10 as like what I have on my RN516 with 6 x 4TB.
Or are you talking about RAID5 and RAID6 only.

Just to clarify your question - I think you are actually wanting to know if the bitrot protection is supported for RAID-10?

BaJohn · ‎2015-02-02

StephenB wrote:
...I do have defrag, balance, scrubs, and disk checks scheduled in the maintenance schedule (each is run once every three months on each volume).

Could you advise the optimum order for the
1. Disk Defragmentation
2. Disk Balance (Not certain what this means - comment please?)
3. Data Scrubbing (Incorrectly described as Disk Scrubbing in the forums, which is something else entirely)
4. Disk Checks (CHKDSK ?)
if you do these end to end (or a day apart, say).
OR
do you spread them randomly/evenly through your 3 month period?

BaJohn · ‎2015-02-02

StephenB wrote:
BaJohn wrote:
mdgm wrote:
You need to be using a md raid level that provides redundancy.

Does that include RAID10 as like what I have on my RN516 with 6 x 4TB.
Or are you talking about RAID5 and RAID6 only.
Just to clarify your question - I think you are actually wanting to know if the bitrot protection is supported for RAID-10?

By default the bitrot protection (BRP) on my RN516 is OFF.
I assumed that as the button was there, it was supported regardless of RAID although it MAY not make sense for certain RAID levels.
I was trying to clarify whether RAID-10 has redundancy in the way you mean it, and hence it is a good thing to have BRP ON.
Perhaps that is the question to be answered 🙂

StephenB · ‎2015-02-02

RAID-10 certainly has redundancy. Netgear will need to confirm that their bitrot feature supports RAID-10 (seems likely though).

mdgm-ntgr · ‎2015-02-02

RAID-10 would work.

BaJohn · ‎2015-02-16

I have read and re-read this post many times and it is full of useful info.
I am just concerned about the following quote.

Nhellie wrote:
As per release notes, "Support bitrot data protection. Automatically detect and correct corruption due to media degradation.", for me this means that as soon as you enable BitRot protection, the NAS will scan-detect-fix corruptions on the data.

To me this reads as proactively scanning the data, and I am not convinced this is how it works.
i.e. If I put data on my ReadyNAS (with snapshots) and never updated it for 5 years, would bitrot be detected?
The above suggests yes, I think it might be no, BUT I really don't know :?