NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.
Forum Discussion
InteXX
Jan 21, 2015Luminary
BitRot Protection Failure - Anything To Be Done?
12:37:25 AM 01/21/2015
Bit rot protection has detected a silent error within /data/FileHistory/Home/C52893-G/Data/$OF/18143/18144 (2015_01_15 11_11_06 UTC).jpg on /dev/sdc3 and cannot correct the error.
My new RN104 has been online only three weeks; I'd hoped to go a long time before seeing something like this. In fact, I wouldn't have expected to see it at all. Clearly BitRot Protection isn't the failsafe I'd thought it was.
Not that I'm too awfully concerned in this particular instance, mind you. The file itself is inconsequential. Upon visual inspection in a graphics editor there doesn't seem to be anything wrong with it at all. In this case at least, it certainly isn't worth going to the trouble to find it and restore it from backup.
I'm just wondering how serious this is in general; is it a sign of a deeper problem; might it occur often; can anything be done to prevent it; can anything be done to fix it after the fact (beyond restoring from backup).
Note that it happened about twenty minutes following a resync completion after replacing a 2TB drive whose sectors were failing.
Thanks,
Jeff Bowman
Fairbanks, Alaska
18 Replies
Replies have been turned off for this discussion
- StephenBGuru - Experienced UserThere's one other user who reported seeing this error on one file too.
At this point I think only Netgear can figure out what is happening. Perhaps download your logs, and PM skywalker (asking if they want to take a look). I'd also leave the file as it is for now. - mdgm-ntgrNETGEAR Employee RetiredIf you download your logs what does your initrd.log look like?
When did you enable bit-rot protection on this share?
When did you add this file to the share?
This sounds like it could be a false positive. - InteXXLuminary
mdgm wrote: If you download your logs what does your initrd.log look like?
Not much...[2014/12/30 08:15:02] Factory default initiated by button!
[2014/12/30 08:15:13] Defaulting to X-RAID2 mode, RAID level 5
[2014/12/30 08:15:41] Factory default initiated on ReadyNASOS 6.1.8 (1398980083).
[2014/12/30 01:31:12] Updated from ReadyNASOS 6.1.8 () to 6.2.2 (ReadyNASOS).mdgm wrote: When did you enable bit-rot protection on this share?
As soon as I created it, on 12/30/2014.mdgm wrote: When did you add this file to the share?
About a week ago; Tuesday last.mdgm wrote: This sounds like it could be a false positive.
Interesting. How to know for sure?
Thanks,
Jeff Bowman
Fairbanks, Alaska - mdgm-ntgrNETGEAR Employee RetiredDo you have the kernel log from the time you got the error?
Can you send in your logs (see the Sending Logs link in my sig)? - InteXXLuminary
mdgm wrote: Do you have the kernel log from the time you got the error?
Hm... there's not much there either:Logs begin at Tue, 20 Jan 2015 23:33:17 -0900, end at Wed, 21 Jan 2015 03:42:24 -0900.
Jan 21 13:24:45 c52893-n kernel: LeafNets: no IPv6 routers presentmdgm wrote: Can you send in your logs (see the Sending Logs link in my sig)?
Sent.
Thanks,
Jeff Bowman
Fairbanks, Alaska - SkywalkerNETGEAR ExpertAre you able to open the file? If so, does it look normal? Fortunately, it should be easy to spot corruption since it's a JPEG image.
- mdgm-ntgrNETGEAR Employee RetiredLooking at your smart_history.log I see this
model realloc_sect realloc_evnt spin_retry_cnt ioedc cmd_timeouts pending_sect uncorrectable_err ata_errors timestamp time
-------------------- ------------ ------------ -------------- ---------- ------------ ------------ ----------------- ---------- ---------- -------------------
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 0 0 0 1419952644 2014-12-30 15:17:24
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 2 0 0 1421596183 2015-01-18 15:49:43
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 8 0 0 1421596309 2015-01-18 15:51:49
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 20 0 0 1421681764 2015-01-19 15:36:04
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 31 0 0 1421681891 2015-01-19 15:38:11
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 41 0 0 1421682393 2015-01-19 15:46:33
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 49 0 0 1421682519 2015-01-19 15:48:39
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 60 0 0 1421682645 2015-01-19 15:50:45
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 67 0 0 1421682770 2015-01-19 15:52:50
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 70 0 0 1421682896 2015-01-19 15:54:56
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 82 0 0 1421683022 2015-01-19 15:57:02
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 94 0 0 1421683148 2015-01-19 15:59:08
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 108 0 0 1421683274 2015-01-19 16:01:14
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 112 0 0 1421683400 2015-01-19 16:03:20
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 125 0 0 1421684154 2015-01-19 16:15:54
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 128 0 0 1421684401 2015-01-19 16:20:01
WDC WD20EARX-00ZUDB0 0 0 0 -1 -1 128 128 0 1421731811 2015-01-20 05:30:11
WDC WD40EFRX-68WT0N0 0 0 0 -1 -1 0 0 0 1421805835 2015-01-21 02:03:55
The current pending sector count increases when a sector can't be read. This is a sure sign of disk failure. I see you have now replaced the disk.
The WD20EARX is not on the compatibility list (http://kb.netgear.com/app/answers/detail/a_id/20641). I can see that one of your remaining WD20EARX has a huge load cycle count and the other is fairly new and looks to be heading the same way. You may wish to alter the WDIDLE3 timer interval for these disks.
I think it's quite possible the drive selection played a part in the failure here.
It would be good if you could replace your remaining WD20EARX disks with 4TB WD RED disks when you get the chance (one at a time, wait for resync to complete before you replace the next disk).
RAID, Bitrot protection, unlimited snapshots and anti-virus are great features that help to protect your data but they can never replace the need for backups. Important data should never be trusted to just the one device. There are things like multiple disk failures, fire, flood and theft that can happen.
With bitrot protection we make use of md raid and btrfs checksums to be able to fix some degradation that couldn't be prevented simply using RAID. - StephenBGuru - Experienced UserOn other threads Jeff indicated that he is only using the EARX drives temporarily - WDC Reds or Pros are in his future.
I'd replace any drive with pending sector counts > 50 (generally I replace them when the reallocated sector + pending sector counts reach the 20s).
fwiw, I'd still say this is worth looking at. RAID and bitrot protection are all about repairing damage when things go wrong. There's no need for RAID if you have healthy disks. Here we have some bad disk events that should have resulted in a normal RAID repair, and we get what appears to a be a false positive detection of bitrot. If it were my code, the first thing I'd suspect is that I had a bug - that somehow the RAID repair and the bitrot detection collided. - SkywalkerNETGEAR ExpertIt's certainly not an established fact that we have a false positive. That's what I'm trying to find out. I'm not exactly sure what you mean by "RAID repair and the bitrot detection collided", but if you mean that the RAID array re-writing blocks to force reallocation corrupts filesystem checksums, well, that doesn't appear to be possible.
- StephenBGuru - Experienced User
I'm glad you are still looking at it. This is new territory for home NAS. It'll be interesting to see how often bitrot is detected and how often it is repaired. I'm hoping posters will continue to report the detection events (and the outcome).Skywalker wrote: It's certainly not an established fact that we have a false positive. That's what I'm trying to find out. I'm not exactly sure what you mean by "RAID repair and the bitrot detection collided", but if you mean that the RAID array re-writing blocks to force reallocation corrupts filesystem checksums, well, that doesn't appear to be possible.
-Jeff said he was able to view the file in a photoviewer, and it looked ok. So it seemed likely that is false, but of course it might have been real. And "false positives" are a bit concerning in their own right. Something triggered the detection after all.
-On "collided" I was thinking that when the processes overlapped there might be race conditions that might create a false positive. I didn't have an exact scenario in mind. But something along the lines of (a) a read error occurs on the checksum test (data or checksum itself), (b) normal RAID repair starts in parallel with bitrot repair.
It seems to me that there are potential issues in scenarios like that. Data that is potentially in memory is being changed on the disk by the competing process (the checksum being likely).
Or something simpler. For instance, maybe bitrot repair starts by confirming that the parity block is inconsistent. The main RAID repair perhaps just made it consistent. Then bitrot repair might think there is nothing it can do (though the problem was just fixed).
Related Content
NETGEAR Academy
Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology!
Join Us!