12:37:25 AM 01/21/2015Bit rot protection has detected a silent error within /data/FileHistory/Home/C52893-G/Data/$OF/18143/18144 (2015_01_15 11_11_06 UTC).jpg on /dev/sdc3 and cannot correct the error.My new RN104 has been online only three weeks; I'd hoped to go a long time before seeing something like this. In fact, I wouldn't have expected to see it at all. Clearly BitRot Protection isn't the failsafe I'd thought it was.Not that I'm too awfully concerned in this particular instance, mind you. The file itself is inconsequential. Upon visual inspection in a graphics editor there doesn't seem to be anything wrong with it at all. In this case at least, it certainly isn't worth going to the trouble to find it and restore it from backup.I'm just wondering how serious this is in general; is it a sign of a deeper problem; might it occur often; can anything be done to prevent it; can anything be done to fix it after the fact (beyond restoring from backup).Note that it happened about twenty minutes following a resync completion after replacing a 2TB drive whose sectors were failing.Thanks,Jeff BowmanFairbanks, Alaska

There's one other user who reported seeing this error on one file too. At this point I think only Netgear can figure out what is happening. Perhaps download your logs, and PM skywalker (asking if they want to take a look). I'd also leave the file as it is for now.

If you download your logs what does your initrd.log look like?When did you enable bit-rot protection on this share?When did you add this file to the share?This sounds like it could be a false positive.

mdgm wrote:If you download your logs what does your initrd.log look like?Not much...[2014/12/30 08:15:02] Factory default initiated by button![2014/12/30 08:15:13] Defaulting to X-RAID2 mode, RAID level 5[2014/12/30 08:15:41] Factory default initiated on ReadyNASOS 6.1.8 (1398980083).[2014/12/30 01:31:12] Updated from ReadyNASOS 6.1.8 () to 6.2.2 (ReadyNASOS).mdgm wrote:When did you enable bit-rot protection on this share?As soon as I created it, on 12/30/2014.mdgm wrote:When did you add this file to the share?About a week ago; Tuesday last.mdgm wrote:This sounds like it could be a false positive.Interesting. How to know for sure?Thanks,Jeff BowmanFairbanks, Alaska

Do you have the kernel log from the time you got the error?Can you send in your logs (see the Sending Logs link in my sig)?

mdgm wrote:Do you have the kernel log from the time you got the error?Hm... there's not much there either:Logs begin at Tue, 20 Jan 2015 23:33:17 -0900, end at Wed, 21 Jan 2015 03:42:24 -0900.Jan 21 13:24:45 c52893-n kernel: LeafNets: no IPv6 routers presentmdgm wrote:Can you send in your logs (see the Sending Logs link in my sig)?Sent.Thanks,Jeff BowmanFairbanks, Alaska

BitRot Protection Failure - Anything To Be Done?

18 Replies

Replies have been turned off for this discussion

StephenB
Guru - Experienced User
Jan 21, 2015
There's one other user who reported seeing this error on one file too.

At this point I think only Netgear can figure out what is happening. Perhaps download your logs, and PM skywalker (asking if they want to take a look). I'd also leave the file as it is for now.
mdgm-ntgr
NETGEAR Employee Retired
Jan 21, 2015
If you download your logs what does your initrd.log look like?

When did you enable bit-rot protection on this share?

When did you add this file to the share?

This sounds like it could be a false positive.
InteXX
Luminary
Jan 21, 2015
mdgm wrote:
If you download your logs what does your initrd.log look like?

Not much...
[2014/12/30 08:15:02] Factory default initiated by button!
[2014/12/30 08:15:13] Defaulting to X-RAID2 mode, RAID level 5
[2014/12/30 08:15:41] Factory default initiated on ReadyNASOS 6.1.8 (1398980083).
[2014/12/30 01:31:12] Updated from ReadyNASOS 6.1.8 () to 6.2.2 (ReadyNASOS).

mdgm wrote:
When did you enable bit-rot protection on this share?

As soon as I created it, on 12/30/2014.

mdgm wrote:
When did you add this file to the share?

About a week ago; Tuesday last.

mdgm wrote:
This sounds like it could be a false positive.

Interesting. How to know for sure?

Thanks,
Jeff Bowman
Fairbanks, Alaska
mdgm-ntgr
NETGEAR Employee Retired
Jan 21, 2015
Do you have the kernel log from the time you got the error?

Can you send in your logs (see the Sending Logs link in my sig)?
InteXX
Luminary
Jan 21, 2015
mdgm wrote:
Do you have the kernel log from the time you got the error?

Hm... there's not much there either:
Logs begin at Tue, 20 Jan 2015 23:33:17 -0900, end at Wed, 21 Jan 2015 03:42:24 -0900.
Jan 21 13:24:45 c52893-n kernel: LeafNets: no IPv6 routers present

mdgm wrote:
Can you send in your logs (see the Sending Logs link in my sig)?

Sent.

Thanks,
Jeff Bowman
Fairbanks, Alaska
Skywalker
NETGEAR Expert
Jan 21, 2015
Are you able to open the file? If so, does it look normal? Fortunately, it should be easy to spot corruption since it's a JPEG image.

mdgm wrote:
If you download your logs what does your initrd.log look like?

mdgm wrote:
When did you enable bit-rot protection on this share?

mdgm wrote:
When did you add this file to the share?

mdgm wrote:
This sounds like it could be a false positive.

mdgm wrote:
Do you have the kernel log from the time you got the error?

mdgm wrote:
Can you send in your logs (see the Sending Logs link in my sig)?

mdgm-ntgr

NETGEAR Employee Retired

Jan 21, 2015

Looking at your smart_history.log I see this


model                 realloc_sect  realloc_evnt  spin_retry_cnt  ioedc       cmd_timeouts  pending_sect  uncorrectable_err  ata_errors  timestamp   time               
--------------------  ------------  ------------  --------------  ----------  ------------  ------------  -----------------  ----------  ----------  -------------------
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            0             0                  0           1419952644  2014-12-30 15:17:24
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            2             0                  0           1421596183  2015-01-18 15:49:43
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            8             0                  0           1421596309  2015-01-18 15:51:49
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            20            0                  0           1421681764  2015-01-19 15:36:04
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            31            0                  0           1421681891  2015-01-19 15:38:11
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            41            0                  0           1421682393  2015-01-19 15:46:33
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            49            0                  0           1421682519  2015-01-19 15:48:39
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            60            0                  0           1421682645  2015-01-19 15:50:45
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            67            0                  0           1421682770  2015-01-19 15:52:50
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            70            0                  0           1421682896  2015-01-19 15:54:56
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            82            0                  0           1421683022  2015-01-19 15:57:02
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            94            0                  0           1421683148  2015-01-19 15:59:08
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            108           0                  0           1421683274  2015-01-19 16:01:14
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            112           0                  0           1421683400  2015-01-19 16:03:20
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            125           0                  0           1421684154  2015-01-19 16:15:54
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            128           0                  0           1421684401  2015-01-19 16:20:01
WDC WD20EARX-00ZUDB0       0             0             0               -1          -1            128           128                0           1421731811  2015-01-20 05:30:11
WDC WD40EFRX-68WT0N0       0             0             0               -1          -1            0             0                  0           1421805835  2015-01-21 02:03:55

The current pending sector count increases when a sector can't be read. This is a sure sign of disk failure. I see you have now replaced the disk.

The WD20EARX is not on the compatibility list (http://kb.netgear.com/app/answers/detail/a_id/20641). I can see that one of your remaining WD20EARX has a huge load cycle count and the other is fairly new and looks to be heading the same way. You may wish to alter the WDIDLE3 timer interval for these disks.

I think it's quite possible the drive selection played a part in the failure here.

It would be good if you could replace your remaining WD20EARX disks with 4TB WD RED disks when you get the chance (one at a time, wait for resync to complete before you replace the next disk).

RAID, Bitrot protection, unlimited snapshots and anti-virus are great features that help to protect your data but they can never replace the need for backups. Important data should never be trusted to just the one device. There are things like multiple disk failures, fire, flood and theft that can happen.

With bitrot protection we make use of md raid and btrfs checksums to be able to fix some degradation that couldn't be prevented simply using RAID.

StephenB
Guru - Experienced User
Jan 21, 2015
On other threads Jeff indicated that he is only using the EARX drives temporarily - WDC Reds or Pros are in his future.

I'd replace any drive with pending sector counts > 50 (generally I replace them when the reallocated sector + pending sector counts reach the 20s).

fwiw, I'd still say this is worth looking at. RAID and bitrot protection are all about repairing damage when things go wrong. There's no need for RAID if you have healthy disks. Here we have some bad disk events that should have resulted in a normal RAID repair, and we get what appears to a be a false positive detection of bitrot. If it were my code, the first thing I'd suspect is that I had a bug - that somehow the RAID repair and the bitrot detection collided.
Skywalker
NETGEAR Expert
Jan 21, 2015
It's certainly not an established fact that we have a false positive. That's what I'm trying to find out. I'm not exactly sure what you mean by "RAID repair and the bitrot detection collided", but if you mean that the RAID array re-writing blocks to force reallocation corrupts filesystem checksums, well, that doesn't appear to be possible.

StephenB

Guru - Experienced User

Jan 22, 2015

Skywalker wrote:
It's certainly not an established fact that we have a false positive. That's what I'm trying to find out. I'm not exactly sure what you mean by "RAID repair and the bitrot detection collided", but if you mean that the RAID array re-writing blocks to force reallocation corrupts filesystem checksums, well, that doesn't appear to be possible.

Skywalker wrote:
It's certainly not an established fact that we have a false positive. That's what I'm trying to find out. I'm not exactly sure what you mean by "RAID repair and the bitrot detection collided", but if you mean that the RAID array re-writing blocks to force reallocation corrupts filesystem checksums, well, that doesn't appear to be possible.

I'm glad you are still looking at it. This is new territory for home NAS. It'll be interesting to see how often bitrot is detected and how often it is repaired. I'm hoping posters will continue to report the detection events (and the outcome).

-Jeff said he was able to view the file in a photoviewer, and it looked ok. So it seemed likely that is false, but of course it might have been real. And "false positives" are a bit concerning in their own right. Something triggered the detection after all.

-On "collided" I was thinking that when the processes overlapped there might be race conditions that might create a false positive. I didn't have an exact scenario in mind. But something along the lines of (a) a read error occurs on the checksum test (data or checksum itself), (b) normal RAID repair starts in parallel with bitrot repair.

It seems to me that there are potential issues in scenarios like that. Data that is potentially in memory is being changed on the disk by the competing process (the checksum being likely).

Or something simpler. For instance, maybe bitrot repair starts by confirming that the parity block is inconsistent. The main RAID repair perhaps just made it consistent. Then bitrot repair might think there is nothing it can do (though the problem was just fixed).

Forum Discussion

BitRot Protection Failure - Anything To Be Done?

18 Replies

Related Content

NETGEAR Nighthawk - Protection Engine Feature FAQ

Loop protection

DDoS protection

Pending Drive Failure - Need Advice

D6220 upgrade failure

NETGEAR Academy

ProSupport for Business