NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.

Forum Discussion

Dewdman42's avatar
Dewdman42
Virtuoso
Dec 10, 2024

Lots of bitrot errors, advice needed

My ReadyNAS is functioning for the most part, but recently I found out that idrive had not been backing it up for quite a while, months...so I fixed that and finally ended up re-backing up the whole thing...and this time I got a number of Bit Rot errors reported to me via email from the ReadNas...while performing the backup.  The files that were reported had not been accessed in quite some time years...  The errors are that bit rot detection could not correct it.

 

So a couple questions, first, what causes bit rot to files that haven't been accessed in a long time, and is there anything I can do periodically to force the ReadyNas to refresh checksums or whatever it needs to refresh so that this won't happen again in the future?  Luckily the files I am losing now, aren't that important, but some other data on this readynas is quite important and uncorrectable bit rot detection would be a disaster.

 

Previously I was down to 5% or less of free space available, is that any reason the automatic refreshing of data didn't happen earlier in some way?  I have since free'd up a lot of space, but just asking.

 

I didn't see any errors reported in ReadyNAS logs other then bit rot detection now suddenly out of the blue as I do this large backup of my volume to idrive, is there some deeper level of Smart info I should look at to make sure my drives are not failing?

 

I don't know what to make of this other than I am losing some 100gb's of data that according to ReadNAS is bit rotten and probably not usable.  but I don't know why it didn't find these things much sooner and move them to better data sectors or whatever the filesystem is supposed to be doing with bit rot detection turned on.  it has been turned on since the start...and yes, I have it regularly do all the maintenance tasks as well.

 

 

 

21 Replies


  • Dewdman42 wrote:

    I have it regularly do all the maintenance tasks as well.

     


     

    Generally the theory on bit rot is that data on a disk is somehow silently changing.  The BTRFS checksums and RAID parity blocks then pick up the errors.  Netgear hasn't published details on how they are attempting the repair.  But I think they are trying all possible combinations of RAID (first assuming that disk 1 is errored, then assuming disk 2, etc), and are looking for a combination where the BTRFS checksums become correct.

     

    The "silently changing" bit could be induced by memory errors in cached writes, or it could simply be a failing disk.  Both would be rare. The disk has to fail in a way that results in errored read that is not detected by the CRC code on the drive.  Memory errors do happen sometimes (cosmic rays have been known to induce them).  I don't think you can rule out the full volume as a cause, since BTRFS can behave badly when the volume gets full.  Lost writes due to an unclean shutdown might also induce this problem.

     

    If these files are only occasionally read, then it is not surprising that it took a long time to detect the errors.  But a scrub should have found them.  How often are you running the scrubs?

     

    Also, is this happening on your RN524?  That ships with ECC memory, so odds of an undetected memory error would be lower on the RN524 than they are on the Ultra 2.

     

     

    I'd run a scrub now, and see if any more of these errors turn up.  You could alternatively try to read every file (perhaps using ssh to copy each to /dev/zero).

    • Dewdman42's avatar
      Dewdman42
      Virtuoso

      All you're saying makes sense as usual.  I have scrubbing scheduled for quarterly and has been that way for years.  So I don't know if that means scrub didn't catch and fix some problems or if it means the problem is new. 

       

      I am still finishing up the idrive backup, will take another day or so, but after that I will try copying to /dev/null to see if that catches the same bit rot errors, before I actually remove the questionable files.

       

      Is there any particular smart logs I should have a look at before presuming drives aren't failing?  It does seem like the most likely culprit may be that I had the volume kind of full for a while there, with less than 5% free space,  now its at 30% free, but anyway, if that in some way contributed to raid getting out of sync in some way such that it cannot correct things..that is rather concerning in and of itself I have to say.  Thank god I got idrive fixed and its nearly backed up, so I will lose these bit rotted files I guess, none of them appear to be anything important that I can't afford to lose so that's fine, but still I have low confidence in the system now.  And I am wondering if I should routinely do anything to try to find and correct bit rot before it can't be corrected any longer.

       

      Scrubbing I'm presuming goes through and attempts to detect these kinds of issues and theoretically correct them, but unless the volume got messed up suddenly and recently on dozens of files...then I think the system should have found and corrected that a long time ago.

       

      Yes it's my RN524.  I did upgrade the RAM years ago, but as I recall I did also use ECC ram.   Improper shutdowns may have occurred from power outages somewhere along the line, but not often.  But anyway, still a mystery why the bit rot wasn't detected and corrected a long time ago.

       

      • Dewdman42's avatar
        Dewdman42
        Virtuoso

        well so looking back further in the logs I do find something that must be the reason for this 18 months ago. see below.  I am not actually sure what happened there, a drive was Downgraded and then resync'd and came back online again.  Not sure why it was downgraded and 18 months since coming back online no other errors...but something about that occurrence must have screwed up checksums or something in the process..

         

        from status.log:

         

        [23/07/05 03:00:02 MDT] notice:volume:LOGMSG_SCRUBSTARTED_VOLUME Scrub started for volume data.
        [23/07/07 08:36:18 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [224] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 11 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/07 09:32:08 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [244] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 12 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/07 12:29:10 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [283] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 13 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/07 13:37:54 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [379] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 14 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/07 14:50:29 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [402] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 15 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/07 16:36:42 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [459] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 16 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/07 17:41:09 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [564] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 17 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/07 18:03:42 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [621] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 18 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/07 19:40:25 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [639] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 19 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/07 20:51:54 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [650] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 20 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/07 21:01:26 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [684] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 21 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/07 22:15:18 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [810] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 22 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/07 22:31:09 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [762] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 23 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/08 00:09:13 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [763] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 24 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/08 00:41:15 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [764] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 25 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/08 01:36:56 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [765] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 26 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/08 03:56:12 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [777] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 27 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/08 04:16:26 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [786] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 28 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/08 04:41:14 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [783] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 29 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/08 15:06:30 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [782] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 30 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/09 03:06:33 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [789] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 31 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/09 03:22:22 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [797] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 32 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/09 03:52:40 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [878] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 33 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/09 03:52:54 MDT] warning:volume:LOGMSG_HEALTH_VOLUME Volume data health changed from Redundant to Degraded.
        [23/07/09 03:53:02 MDT] err:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 2 (Internal) changed state from ONLINE to FAILED.
        [23/07/09 04:24:41 MDT] notice:volume:LOGMSG_SCRUBCOMPLETE_VOLUME Scrub completed for volume data'.
        [23/07/09 08:36:25 MDT] notice:system:LOGMSG_SYSTEM_HALT The system is shutting down.
        [23/07/13 16:25:12 MDT] warning:volume:LOGMSG_VOLUME_USAGE_CRITICAL Less than 10% of volume data's capacity is free. data's performance is degraded and you risk running out of usable space. To improve performance and stability, you must add capacity or make free space.
        [23/07/13 16:25:12 MDT] info:system:LOGMSG_START_READYNASD ReadyNASOS background service started.
        [23/07/13 16:25:13 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [933] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 34 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
        [23/07/13 16:25:17 MDT] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
        [23/07/13 16:25:44 MDT] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.
        [23/07/13 16:32:25 MDT] notice:system:LOGMSG_SYSTEM_HALT The system is shutting down.
        [23/07/13 16:35:21 MDT] warning:volume:LOGMSG_VOLUME_USAGE_CRITICAL Less than 10% of volume data's capacity is free. data's performance is degraded and you risk running out of usable space. To improve performance and stability, you must add capacity or make free space.
        [23/07/13 16:35:22 MDT] info:system:LOGMSG_START_READYNASD ReadyNASOS background service started.
        [23/07/13 16:35:26 MDT] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
        [23/07/13 16:36:27 MDT] notice:system:LOGMSG_SYSTEM_HALT The system is shutting down.
        [23/07/13 16:46:05 MDT] warning:volume:LOGMSG_VOLUME_USAGE_CRITICAL Less than 10% of volume data's capacity is free. data's performance is degraded and you risk running out of usable space. To improve performance and stability, you must add capacity or make free space.
        [23/07/13 16:46:06 MDT] info:system:LOGMSG_START_READYNASD ReadyNASOS background service started.
        [23/07/13 16:46:09 MDT] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
        [23/07/13 16:46:56 MDT] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.
        [23/07/14 01:00:17 MDT] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
        [23/07/14 17:15:29 MDT] notice:volume:LOGMSG_RESILVERCOMPLETE_VOLUME Volume data is resynced.
        [23/07/14 17:15:35 MDT] notice:volume:LOGMSG_HEALTH_VOLUME Volume data health changed from Degraded to Redundant.
        [23/07/14 17:15:40 MDT] notice:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 2 (Internal) changed state from RESYNC to ONLINE.

         

        that was 18 months ago and now I vaguely remember this, I think I may have shutdown, removed the drive, restarted and shutdown again, and re-inserted it or something along those lines to try to force a resync of it...which basically seemed to have fixed it since no other disk errors for 18 months after that....but...it does look like all is not happy in paradise..whatever happened 18 months ago must have messed up some files or checksums.  I don't know why scrub didn't catch it since then, it ran every quarter since then without errors of any kind.  

         

        Anyway I am scrubbing it again now and will try the copy to /dev/null thing to see if I can find any other bit rotted files and just get rid of them, but the open question is what should I do at this point to ensure my volume is healthy.  I'm not sure why disk#2 failed like that and then performed fine up until now after re-sync..that is question #1 and brings up the question of whether I should replace the drive now.  but even if I do, if my data is compromised it makes me wonder if I should completely nuke the whole thing and start over...but...like I said..other then these recent bit rot errors, I haven't seen any other disk errors or anything to indicate any problem.  Nuking it and starting over would be very painful experience for me, I don't have an easy way to do it and would definitely lose a bunch of data that I am not backing up to idrive, not mission critical data, but still I don't really want to lose it.  I don't have any other disks big enough to back it up locally either...all solvable with cash of course..but it gets painful if that is the route I have to go.  At some point I am going to move on past this ReadyNAS anyway, to something else...not sure what yet...Unraid or synology...or maybe I might get an older Intel Mac Mini and turn it into my NAS..  haven't decided yet I was hoping to kick that can down the road a few more years with this ReadyNAS...so I hate to spend money on it at the moment...

         

        One thing for certain I need to make sure my mission critical data is backed up locally and hopefully its ok...so far no Bit rot errors there.

         

        Are there any other log files I should check to see what is going up?  Nothing else shows up from the frontview log 

         

         

         

         

         

         

         

         

         

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology! 

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

 

Learn More