Forum Discussion

Virtuoso

Dec 10, 2024

Lots of bitrot errors, advice needed

My ReadyNAS is functioning for the most part, but recently I found out that idrive had not been backing it up for quite a while, months...so I fixed that and finally ended up re-backing up the whole ...

StephenB

Guru - Experienced User

Dec 11, 2024

Dewdman42 wrote:

I have it regularly do all the maintenance tasks as well.

Generally the theory on bit rot is that data on a disk is somehow silently changing. The BTRFS checksums and RAID parity blocks then pick up the errors. Netgear hasn't published details on how they are attempting the repair. But I think they are trying all possible combinations of RAID (first assuming that disk 1 is errored, then assuming disk 2, etc), and are looking for a combination where the BTRFS checksums become correct.

The "silently changing" bit could be induced by memory errors in cached writes, or it could simply be a failing disk. Both would be rare. The disk has to fail in a way that results in errored read that is not detected by the CRC code on the drive. Memory errors do happen sometimes (cosmic rays have been known to induce them). I don't think you can rule out the full volume as a cause, since BTRFS can behave badly when the volume gets full. Lost writes due to an unclean shutdown might also induce this problem.

If these files are only occasionally read, then it is not surprising that it took a long time to detect the errors. But a scrub should have found them. How often are you running the scrubs?

Also, is this happening on your RN524? That ships with ECC memory, so odds of an undetected memory error would be lower on the RN524 than they are on the Ultra 2.

I'd run a scrub now, and see if any more of these errors turn up. You could alternatively try to read every file (perhaps using ssh to copy each to /dev/zero).

Dewdman42

Virtuoso

Dec 11, 2024

All you're saying makes sense as usual. I have scrubbing scheduled for quarterly and has been that way for years. So I don't know if that means scrub didn't catch and fix some problems or if it means the problem is new.

I am still finishing up the idrive backup, will take another day or so, but after that I will try copying to /dev/null to see if that catches the same bit rot errors, before I actually remove the questionable files.

Is there any particular smart logs I should have a look at before presuming drives aren't failing? It does seem like the most likely culprit may be that I had the volume kind of full for a while there, with less than 5% free space, now its at 30% free, but anyway, if that in some way contributed to raid getting out of sync in some way such that it cannot correct things..that is rather concerning in and of itself I have to say. Thank god I got idrive fixed and its nearly backed up, so I will lose these bit rotted files I guess, none of them appear to be anything important that I can't afford to lose so that's fine, but still I have low confidence in the system now. And I am wondering if I should routinely do anything to try to find and correct bit rot before it can't be corrected any longer.

Scrubbing I'm presuming goes through and attempts to detect these kinds of issues and theoretically correct them, but unless the volume got messed up suddenly and recently on dozens of files...then I think the system should have found and corrected that a long time ago.

Yes it's my RN524. I did upgrade the RAM years ago, but as I recall I did also use ECC ram. Improper shutdowns may have occurred from power outages somewhere along the line, but not often. But anyway, still a mystery why the bit rot wasn't detected and corrected a long time ago.

Dewdman42

Virtuoso

Dec 11, 2024

well so looking back further in the logs I do find something that must be the reason for this 18 months ago. see below. I am not actually sure what happened there, a drive was Downgraded and then resync'd and came back online again. Not sure why it was downgraded and 18 months since coming back online no other errors...but something about that occurrence must have screwed up checksums or something in the process..

from status.log:

[23/07/05 03:00:02 MDT] notice:volume:LOGMSG_SCRUBSTARTED_VOLUME Scrub started for volume data.
[23/07/07 08:36:18 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [224] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 11 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 09:32:08 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [244] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 12 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 12:29:10 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [283] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 13 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 13:37:54 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [379] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 14 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 14:50:29 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [402] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 15 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 16:36:42 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [459] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 16 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 17:41:09 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [564] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 17 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 18:03:42 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [621] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 18 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 19:40:25 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [639] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 19 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 20:51:54 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [650] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 20 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 21:01:26 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [684] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 21 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 22:15:18 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [810] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 22 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 22:31:09 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [762] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 23 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 00:09:13 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [763] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 24 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 00:41:15 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [764] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 25 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 01:36:56 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [765] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 26 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 03:56:12 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [777] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 27 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 04:16:26 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [786] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 28 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 04:41:14 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [783] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 29 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 15:06:30 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [782] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 30 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/09 03:06:33 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [789] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 31 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/09 03:22:22 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [797] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 32 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/09 03:52:40 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [878] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 33 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/09 03:52:54 MDT] warning:volume:LOGMSG_HEALTH_VOLUME Volume data health changed from Redundant to Degraded.
[23/07/09 03:53:02 MDT] err:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 2 (Internal) changed state from ONLINE to FAILED.
[23/07/09 04:24:41 MDT] notice:volume:LOGMSG_SCRUBCOMPLETE_VOLUME Scrub completed for volume data'.
[23/07/09 08:36:25 MDT] notice:system:LOGMSG_SYSTEM_HALT The system is shutting down.
[23/07/13 16:25:12 MDT] warning:volume:LOGMSG_VOLUME_USAGE_CRITICAL Less than 10% of volume data's capacity is free. data's performance is degraded and you risk running out of usable space. To improve performance and stability, you must add capacity or make free space.
[23/07/13 16:25:12 MDT] info:system:LOGMSG_START_READYNASD ReadyNASOS background service started.
[23/07/13 16:25:13 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [933] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 34 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/13 16:25:17 MDT] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[23/07/13 16:25:44 MDT] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.
[23/07/13 16:32:25 MDT] notice:system:LOGMSG_SYSTEM_HALT The system is shutting down.
[23/07/13 16:35:21 MDT] warning:volume:LOGMSG_VOLUME_USAGE_CRITICAL Less than 10% of volume data's capacity is free. data's performance is degraded and you risk running out of usable space. To improve performance and stability, you must add capacity or make free space.
[23/07/13 16:35:22 MDT] info:system:LOGMSG_START_READYNASD ReadyNASOS background service started.
[23/07/13 16:35:26 MDT] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[23/07/13 16:36:27 MDT] notice:system:LOGMSG_SYSTEM_HALT The system is shutting down.
[23/07/13 16:46:05 MDT] warning:volume:LOGMSG_VOLUME_USAGE_CRITICAL Less than 10% of volume data's capacity is free. data's performance is degraded and you risk running out of usable space. To improve performance and stability, you must add capacity or make free space.
[23/07/13 16:46:06 MDT] info:system:LOGMSG_START_READYNASD ReadyNASOS background service started.
[23/07/13 16:46:09 MDT] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[23/07/13 16:46:56 MDT] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.
[23/07/14 01:00:17 MDT] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[23/07/14 17:15:29 MDT] notice:volume:LOGMSG_RESILVERCOMPLETE_VOLUME Volume data is resynced.
[23/07/14 17:15:35 MDT] notice:volume:LOGMSG_HEALTH_VOLUME Volume data health changed from Degraded to Redundant.
[23/07/14 17:15:40 MDT] notice:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 2 (Internal) changed state from RESYNC to ONLINE.

that was 18 months ago and now I vaguely remember this, I think I may have shutdown, removed the drive, restarted and shutdown again, and re-inserted it or something along those lines to try to force a resync of it...which basically seemed to have fixed it since no other disk errors for 18 months after that....but...it does look like all is not happy in paradise..whatever happened 18 months ago must have messed up some files or checksums. I don't know why scrub didn't catch it since then, it ran every quarter since then without errors of any kind.

Anyway I am scrubbing it again now and will try the copy to /dev/null thing to see if I can find any other bit rotted files and just get rid of them, but the open question is what should I do at this point to ensure my volume is healthy. I'm not sure why disk#2 failed like that and then performed fine up until now after re-sync..that is question #1 and brings up the question of whether I should replace the drive now. but even if I do, if my data is compromised it makes me wonder if I should completely nuke the whole thing and start over...but...like I said..other then these recent bit rot errors, I haven't seen any other disk errors or anything to indicate any problem. Nuking it and starting over would be very painful experience for me, I don't have an easy way to do it and would definitely lose a bunch of data that I am not backing up to idrive, not mission critical data, but still I don't really want to lose it. I don't have any other disks big enough to back it up locally either...all solvable with cash of course..but it gets painful if that is the route I have to go. At some point I am going to move on past this ReadyNAS anyway, to something else...not sure what yet...Unraid or synology...or maybe I might get an older Intel Mac Mini and turn it into my NAS.. haven't decided yet I was hoping to kick that can down the road a few more years with this ReadyNAS...so I hate to spend money on it at the moment...

One thing for certain I need to make sure my mission critical data is backed up locally and hopefully its ok...so far no Bit rot errors there.

Are there any other log files I should check to see what is going up? Nothing else shows up from the frontview log

StephenB
Guru - Experienced User
Dec 11, 2024
Dewdman42 wrote:

well so looking back further in the logs I do find something that must be the reason for this 18 months ago. see below.

I agree it is at least related.

800+ pending sectors does mean the disk needs to be replaced.
- Dewdman42
  Virtuoso
  Dec 14, 2024
  What are you referring to about the 800 pending sectors? Please forgive me if this is obvious?
  
  I haven't found any other errors other than 18 months ago when the drive was resync'd for some reason, errors gone after that.. except for these bit rot errors now while doing big re-backup....

StephenB
Guru - Experienced User
Dec 11, 2024
Dewdman42 wrote:

Is there any particular smart logs I should have a look at before presuming drives aren't failing?

smart_history.log is worth a look. If the issue appears related to startup, then dmesg.log is the best place to start..

I also look in system.log, kernel.log, and systemd-journal.log.
- Dewdman42
  Virtuoso
  Dec 14, 2024
  StephenB wrote:
  Dewdman42 wrote:
  
  Is there any particular smart logs I should have a look at before presuming drives aren't failing?
  smart_history.log is worth a look. If the issue appears related to startup, then dmesg.log is the best place to start..
  
  I also look in system.log, kernel.log, and systemd-journal.log.
  
  So smart_history.log, no errors.
  
  dmesg.log, has a lot of file system warnings that look similar as this, and these were happening past few days while doing large re-backup:
  
  [Thu Dec 12 18:43:28 2024] BTRFS warning (device md127): csum failed ino 1665367 off 518217728 csum 1087457351 expected csum 3010218967 [Thu Dec 12 18:43:28 2024] BTRFS warning (device md127): csum failed ino 1665367 off 518217728 csum 1087457351 expected csum 3010218967
  
  system.log, I don't know what to look for in there, but I don't see anything disk related on first look
  
  kernel.log has the same warnings as dmesg.log, about BTRFS csum failed.
  
  systemd.journal..og appears to have the bit rot errors while I was doing the re-backup. Don't know what else to look for in there, don't see anything initially.
  
  So the above messages could all be related to the bit rot detected..I'm still not sure how to determine if the drive actually should be replaced or not. I feel like somehow it got out of sync 18 months ago and then when I removed it and resented it, no more disk problems reported since then..other then this bit rot...whcih may have happened 18 months ago while it got out of sync for all I know. I don't really know. I had not done a complete re-backup of this volume in quite a long time and the files in question had not been accessed to trigger the bit rot detection.
  
  Plus I'm not even sure if replacing that disk that got out of sync 18 months ago, will actually fix whatever checksum weirdness I have now. If the entire volume is compromised in some way it seems like I might have to transfer the data to some other device and rebuild the entire raid array from scratch, etc. just thinking out lout here, but again..if there is some reason to know the drive is bad and replacing it will cure all concerns I am not opposed to doing that, but I'm not sure how to determine that is actually the problem.

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology!

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

Learn More

Forum Discussion

Lots of bitrot errors, advice needed

Related Content

advice

Home Networking advice

Enable Bitrot on Files: MV or CP?

Add reservation error

Bitrot detected in system files?

NETGEAR Academy

ProSupport for Business