Forum Discussion

Virtuoso

Dec 10, 2024

Lots of bitrot errors, advice needed

My ReadyNAS is functioning for the most part, but recently I found out that idrive had not been backing it up for quite a while, months...so I fixed that and finally ended up re-backing up the whole ...

Dewdman42

Virtuoso

Dec 11, 2024

All you're saying makes sense as usual. I have scrubbing scheduled for quarterly and has been that way for years. So I don't know if that means scrub didn't catch and fix some problems or if it means the problem is new.

I am still finishing up the idrive backup, will take another day or so, but after that I will try copying to /dev/null to see if that catches the same bit rot errors, before I actually remove the questionable files.

Is there any particular smart logs I should have a look at before presuming drives aren't failing? It does seem like the most likely culprit may be that I had the volume kind of full for a while there, with less than 5% free space, now its at 30% free, but anyway, if that in some way contributed to raid getting out of sync in some way such that it cannot correct things..that is rather concerning in and of itself I have to say. Thank god I got idrive fixed and its nearly backed up, so I will lose these bit rotted files I guess, none of them appear to be anything important that I can't afford to lose so that's fine, but still I have low confidence in the system now. And I am wondering if I should routinely do anything to try to find and correct bit rot before it can't be corrected any longer.

Scrubbing I'm presuming goes through and attempts to detect these kinds of issues and theoretically correct them, but unless the volume got messed up suddenly and recently on dozens of files...then I think the system should have found and corrected that a long time ago.

Yes it's my RN524. I did upgrade the RAM years ago, but as I recall I did also use ECC ram. Improper shutdowns may have occurred from power outages somewhere along the line, but not often. But anyway, still a mystery why the bit rot wasn't detected and corrected a long time ago.

Dewdman42

Virtuoso

Dec 11, 2024

well so looking back further in the logs I do find something that must be the reason for this 18 months ago. see below. I am not actually sure what happened there, a drive was Downgraded and then resync'd and came back online again. Not sure why it was downgraded and 18 months since coming back online no other errors...but something about that occurrence must have screwed up checksums or something in the process..

from status.log:

[23/07/05 03:00:02 MDT] notice:volume:LOGMSG_SCRUBSTARTED_VOLUME Scrub started for volume data.
[23/07/07 08:36:18 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [224] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 11 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 09:32:08 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [244] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 12 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 12:29:10 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [283] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 13 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 13:37:54 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [379] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 14 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 14:50:29 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [402] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 15 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 16:36:42 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [459] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 16 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 17:41:09 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [564] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 17 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 18:03:42 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [621] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 18 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 19:40:25 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [639] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 19 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 20:51:54 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [650] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 20 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 21:01:26 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [684] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 21 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 22:15:18 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [810] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 22 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/07 22:31:09 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [762] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 23 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 00:09:13 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [763] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 24 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 00:41:15 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [764] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 25 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 01:36:56 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [765] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 26 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 03:56:12 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [777] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 27 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 04:16:26 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [786] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 28 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 04:41:14 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [783] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 29 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/08 15:06:30 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [782] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 30 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/09 03:06:33 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [789] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 31 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/09 03:22:22 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [797] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 32 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/09 03:52:40 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [878] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 33 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/09 03:52:54 MDT] warning:volume:LOGMSG_HEALTH_VOLUME Volume data health changed from Redundant to Degraded.
[23/07/09 03:53:02 MDT] err:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 2 (Internal) changed state from ONLINE to FAILED.
[23/07/09 04:24:41 MDT] notice:volume:LOGMSG_SCRUBCOMPLETE_VOLUME Scrub completed for volume data'.
[23/07/09 08:36:25 MDT] notice:system:LOGMSG_SYSTEM_HALT The system is shutting down.
[23/07/13 16:25:12 MDT] warning:volume:LOGMSG_VOLUME_USAGE_CRITICAL Less than 10% of volume data's capacity is free. data's performance is degraded and you risk running out of usable space. To improve performance and stability, you must add capacity or make free space.
[23/07/13 16:25:12 MDT] info:system:LOGMSG_START_READYNASD ReadyNASOS background service started.
[23/07/13 16:25:13 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [933] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 34 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
[23/07/13 16:25:17 MDT] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[23/07/13 16:25:44 MDT] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.
[23/07/13 16:32:25 MDT] notice:system:LOGMSG_SYSTEM_HALT The system is shutting down.
[23/07/13 16:35:21 MDT] warning:volume:LOGMSG_VOLUME_USAGE_CRITICAL Less than 10% of volume data's capacity is free. data's performance is degraded and you risk running out of usable space. To improve performance and stability, you must add capacity or make free space.
[23/07/13 16:35:22 MDT] info:system:LOGMSG_START_READYNASD ReadyNASOS background service started.
[23/07/13 16:35:26 MDT] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[23/07/13 16:36:27 MDT] notice:system:LOGMSG_SYSTEM_HALT The system is shutting down.
[23/07/13 16:46:05 MDT] warning:volume:LOGMSG_VOLUME_USAGE_CRITICAL Less than 10% of volume data's capacity is free. data's performance is degraded and you risk running out of usable space. To improve performance and stability, you must add capacity or make free space.
[23/07/13 16:46:06 MDT] info:system:LOGMSG_START_READYNASD ReadyNASOS background service started.
[23/07/13 16:46:09 MDT] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[23/07/13 16:46:56 MDT] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.
[23/07/14 01:00:17 MDT] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[23/07/14 17:15:29 MDT] notice:volume:LOGMSG_RESILVERCOMPLETE_VOLUME Volume data is resynced.
[23/07/14 17:15:35 MDT] notice:volume:LOGMSG_HEALTH_VOLUME Volume data health changed from Degraded to Redundant.
[23/07/14 17:15:40 MDT] notice:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 2 (Internal) changed state from RESYNC to ONLINE.

that was 18 months ago and now I vaguely remember this, I think I may have shutdown, removed the drive, restarted and shutdown again, and re-inserted it or something along those lines to try to force a resync of it...which basically seemed to have fixed it since no other disk errors for 18 months after that....but...it does look like all is not happy in paradise..whatever happened 18 months ago must have messed up some files or checksums. I don't know why scrub didn't catch it since then, it ran every quarter since then without errors of any kind.

Anyway I am scrubbing it again now and will try the copy to /dev/null thing to see if I can find any other bit rotted files and just get rid of them, but the open question is what should I do at this point to ensure my volume is healthy. I'm not sure why disk#2 failed like that and then performed fine up until now after re-sync..that is question #1 and brings up the question of whether I should replace the drive now. but even if I do, if my data is compromised it makes me wonder if I should completely nuke the whole thing and start over...but...like I said..other then these recent bit rot errors, I haven't seen any other disk errors or anything to indicate any problem. Nuking it and starting over would be very painful experience for me, I don't have an easy way to do it and would definitely lose a bunch of data that I am not backing up to idrive, not mission critical data, but still I don't really want to lose it. I don't have any other disks big enough to back it up locally either...all solvable with cash of course..but it gets painful if that is the route I have to go. At some point I am going to move on past this ReadyNAS anyway, to something else...not sure what yet...Unraid or synology...or maybe I might get an older Intel Mac Mini and turn it into my NAS.. haven't decided yet I was hoping to kick that can down the road a few more years with this ReadyNAS...so I hate to spend money on it at the moment...

One thing for certain I need to make sure my mission critical data is backed up locally and hopefully its ok...so far no Bit rot errors there.

Are there any other log files I should check to see what is going up? Nothing else shows up from the frontview log

StephenB
Guru - Experienced User
Dec 11, 2024
Dewdman42 wrote:

well so looking back further in the logs I do find something that must be the reason for this 18 months ago. see below.

I agree it is at least related.

800+ pending sectors does mean the disk needs to be replaced.
- Dewdman42
  Virtuoso
  Dec 14, 2024
  What are you referring to about the 800 pending sectors? Please forgive me if this is obvious?
  
  I haven't found any other errors other than 18 months ago when the drive was resync'd for some reason, errors gone after that.. except for these bit rot errors now while doing big re-backup....
  - StephenB
    Guru - Experienced User
    Dec 14, 2024
    Dewdman42 wrote:
    
    What are you referring to about the 800 pending sectors? Please forgive me if this is obvious?
    
    See status.log above
    
    [23/07/09 03:52:40 MDT] crit:disk:LOGMSG_SMART_PENDING_SECT_30DAYS_WARN Detected increasing pending sector: count [878] on disk 2 (Internal) [WDC WD60EFRX-68L0BN1, WD-WX11DA7R3J4V] 33 times in the past 30 days. This condition often indicates an impending failure. Be prepared to replace this disk to maintain data redundancy.
    
    This problem goes back to July 2023. "Pending Sectors" are sectors that could not be read (generating an error, but not reallocated). The drive was marked as failed by the ReadyNAS (also on 9 July), but the reboot cleared that status. Although the system did eventually resync that disk after a couple of reboots, I suspect the issues on the disk caused the problem.
    
    The volume also was very full around that time
    
    [23/07/13 16:46:05 MDT] warning:volume:LOGMSG_VOLUME_USAGE_CRITICAL Less than 10% of volume data's capacity is free. data's performance is degraded and you risk running out of usable space. To improve performance and stability, you must add capacity or make free space.
    
    BTRFS doesn't behave will when it runs out of usable free space, so I always recommend keeping at least 15%.