I have been using this Readynas for the past 1.5 years, but all of a sudden on Oct 4, I found I couldn't write to the share. Digging revealed a btrfs problem in dmesg log:BTRFS critical (device md127): corrupt leaf, slot offset bad: block=1064206336, root=1, slot=287BTRFS critical (device md127): corrupt leaf, slot offset bad: block=1064206336, root=1, slot=287BTRFS: error (device md127) in btrfs_run_delayed_refs:2995: errno=-5 IO failureBTRFS info (device md127): forced readonly with the following advice in the logs on the Web UI : "Volume: The volume test1 encountered an error and was made read-only. It is recommended to backup your data."This was good advice, since I hadn't properly backed up the contents of the NAS (I had a couple of other HDDs which held the data before they went onto this NAS). I went ahead, bought a 4TB WD Blue, slotted it into an external HDD case, attached it via eSATA, and ran the backup jobs which have completed. I will test the backup soon and then proceed to try fixing the btrfs issue.Around the time I was getting used to the backup UI, I encountered a repeated bunch (1000s) of errors that still continue to this moment:BTRFS error (device md127): parent transid verify failed on 211156992 wanted 823860 found 823859Any ideas what this means? Someone on StackOverflow thinks this has something to do with NFS — this seems sensible to me. Will bitrot protection help to prevent these situations?Of course, I can redo the volume, but this error has come with no warning, no SMART errors, nothing - I would like to root-cause this if possible, esp given that now this entire line of hardware is unsupported and end-of-life. I'm not ready to give up and resort to backups every time this happens, if this is not an isolated occurrence.

milind2021 wrote: BTRFS error (device md127): parent transid verify failed on 211156992 wanted 823860 found 823859 Any ideas what this means? Someone on StackOverflow thinks this has something to do with NFS — this seems sensible to me. Nothing to do with NFS. Note the update on that post - a couple days later the problem came back. These are linked to the earlier corrupt leaf problem with BTRFS. The on-disk metadata structures of your file system are damaged. Disk errors (including bit rot) of course can corrupt these structures. There are other possibilties - power loss, crashes or improper shutdowns result in cached writes never making it to the disk. If the file system becomes completely full, the metadata updates might not happen properly. The damage might not show up right away, particularly if you don't reboot very often. milind2021 wrote: "Volume: The volume test1 encountered an error and was made read-only. It is recommended to backup your data." This was good advice, since I hadn't properly backed up the contents of the NAS (I had a couple of other HDDs which held the data before they went onto this NAS). I went ahead, bought a 4TB WD Blue, slotted it into an external HDD case, attached it via eSATA, and ran the backup jobs which have completed. I will test the backup soon and then proceed to try fixing the btrfs issue. You did exactly the right thing. Rebooting the NAS likely would have resulted in a lost volume at that point, and while fixing a btrfs error is sometimes possible, it still generally will result in some data loss. milind2021 wrote: Will bitrot protection help to prevent these situations? Maybe it will help some, but no guarantees. It has occasionally kicked in over the years on my own NAS, but I haven't seen it actually succeed in repairing something. If you aren't using a UPS with the NAS, then I recommend getting one. That eliminates the risk of an unexpected power loss corrupting the file system. I think keeping the volume below 90% full is one important measure - BTRFS metadata/on-disk structures use the same storage pool as the files themselves, so a full disk can lead to serious problems. If you use snapshots, then I recommend avoiding the so-called "smart" snapshot setting. Monthly snapshots are maintained forever, which eventually fills the volume. Instead use the custom snapshots, and explicitly set the retention you want. Set the "only make snapshots when there are changes" setting to reduce the clutter. Then watch the snapshot space, and balance the retention against the acceptable overhead. I also recommend scheduling the maintenance functions. Both the disk test and scrub exercise the disks (reading every sector), so scrub also can double as a diagnostic. I cycle through one test per month, so each test is run 3 times a year. My schedule puts something between scrub and test (for example, test, balance, scrub, defrag), so every sector of the disks is read every other month. The idea here is to get early warning of disk problems - particularly important if you archive a lot of files that are only rarely read. Note you need to keep the NAS powered during these tests - the NAS won't resume them on power up. milind2021 wrote: Of course, I can redo the volume, but this error has come with no warning, no SMART errors, nothing - I would like to root-cause this if possible While it might be educational to troubleshoot this, I suggest redoing the volume anyway when you are done. These are rare (I have five OS-6 NAS running for a long time now, and have never seen it on my own systems). And BTRFS repair is considered dangerous/risky by the BTRFS teams. Even if the repair appears to work, there could be some residual damage that would be hard to spot (and which will cause issues later on). So after I've helped folks remount failed volumes, I always recommend that they make a backup and then start over with a clean file system. As far as SMART goes, I have had disk errors that never show up in the SMART stats (in fact a lot of disks fail with no SMART errors). Most disks now log the errors internally, and you can see those errors with smartctl -x. I have seen UNCs in those logs that never showed up anywhere else. You can also run the full disk test manually with smartctl. Though that just confirms that the disk can be read, since it doesn't try to write anything. milind2021 wrote: I'm not ready to give up and resort to backups every time this happens, RAID is not enough to keep your data safe - the best strategy for that is to copies on other devices. You were fortunate this time, as you saw the error message and followed the advice it gave. That might not happen next time. So it is important to put a backup plan in place. Note physical loss is one scenario (fire, flood, theft, power surges, ...), so cloud backup or off-site storage should be part of your backup plan. Malware encryption is another scenario to consider. Personally I back up my primary NAS to other NAS daily, and augment that with cloud backup. Backup NAS are powered off at other times, in order to slow any malware spread to the backups. As my primary storage grows, I increase the backup storage to accomodate it. Although this costs, data recovery is more expensive than backup (and often fails to recover what you need).

Thanks for that detailed reply. A few points that you raised:The NAS is powered by an APC UPS connected via USB, which triggers NAS shutdown when the battery goes low. I don't expect there to be more than a few abrupt power-loss events in total, and none since maybe a year.The volume is 70-75% full as of now.I am not yet using snapshots, but I know I'll have to use them once I begin regularly backing up.I have not yet set up the maintenance tasks, good point, I will make sure to use it going forward.Thanks for the "smartctl -x" command, but I don't see anything being flagged, but I found a couple of lines on the Seagate Ironwolf (the other drive is a WD Red CMR) that might be problematic:SMART Attributes Data Structure revision number: 10Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE1 Raw_Read_Error_Rate POSR-- 117 099 006 - 1663732163 Spin_Up_Time PO---- 096 096 000 - 04 Start_Stop_Count -O--CK 100 100 020 - 1645 Reallocated_Sector_Ct PO--CK 100 100 010 - 07 Seek_Error_Rate POSR-- 083 060 030 - 2013727299 Power_On_Hours -O--CK 074 074 000 - 2287910 Spin_Retry_Count PO--C- 100 100 097 - 012 Power_Cycle_Count -O--CK 100 100 020 - 164184 End-to-End_Error -O--CK 100 100 099 - 0187 Reported_Uncorrect -O--CK 100 100 000 - 0188 Command_Timeout -O--CK 100 100 000 - 0189 High_Fly_Writes -O-RCK 054 054 000 - 46190 Airflow_Temperature_Cel -O---K 063 052 045 - 37 (Min/Max 24/37)191 G-Sense_Error_Rate -O--CK 100 100 000 - 0192 Power-Off_Retract_Count -O--CK 100 100 000 - 91193 Load_Cycle_Count -O--CK 100 100 000 - 193194 Temperature_Celsius -O---K 037 048 000 - 37 (0 20 0 0 0)197 Current_Pending_Sector -O--C- 100 100 000 - 0198 Offline_Uncorrectable ----C- 100 100 000 - 0199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0||||||_ K auto-keep|||||__ C event count||||___ R error rate|||____ S speed/performance||_____ O updated online|______ P prefailure warning The raw read and seek error rates seem to be high, but the "value" seems to indicate that's just the internal way they're recorded and that they're not an issue. The WD drive did not have any such issues.Do you recommend a memory test? Because a number of other posts on the internet on btrfs point out that disk corruption is unlikely to cause these issues, as against RAM corruption.

I don't see anything concerning in the Seagate SMART stats. milind2021 wrote: Do you recommend a memory test? Because a number of other posts on the internet on btrfs point out that disk corruption is unlikely to cause these issues, as against RAM corruption. Not sure memory corruption being more likely, but there is no harm in running the memory test.

So a bunch of updates after I went on the btrfs IRC channel, and had some guidance:All errors after forced readonly are spurious because in-kernel state of FS is different from on-disk state. Their advice was to ignore the parent transid error that came after the forced-readonly state.The error seems to be in metadata rather than in data, which they said is worse than if it happened to data.The corrupt leaf issue seems to be one item (#287 as indicated by the kernel log) in the dump-tree of that block (again indicated in that kernel message), with its ref and gen both left shifted by 4 bytes. Zygo from IRC list says that can happen with RAM clock slip (off-by-one on 32-bit machine).I was unable to unmount the volume in the forced readonly state, and so I couldn't run scrub or checkI rebooted, it came up fine with RW mounting, but the corrupt leaf error still showed up on kernel logScrub still fails when it encounters corruption, because this is host-level corruption (RAM etc), rather than device-level (so all FS copies of metadata are bad and hence scrub can't do anything).Check (readonly) points out large number (100ish) of instances of around 3 errors.I ran btrfs-check once with FS mounted RW, then rebooted into "volume read-only" mode where FS was mounted RO and then ran it, and then rebooted into "tech-support mode" where FS was unmounted and then ran it again. All three times it found errors, but the logs were mildly different.I will upload exact logs later in the day. Note that I'm not attempting to recover data, I only want to diagnose the root cause if any. Next steps I took up are:Check the filesystem from a later version of btrfs-progs since v4.4 is quite old and meanwhile it has gotten a lot better at checking.Run a memory test.I rebooted into memory test mode, it started up with blinking the "Backup" light as it should, but about 2-5 minutes later it stops blinking and Power, Disk1 and Disk2 LEDs light up and stay on solid. RAIDar doesn't detect anything during this entire phase.Does this look like a warranty issue? I really need to run a memory check...

I'm currently running OS v6.10.5 hotfix 1. Will updating to 6.10.8 help in fixing the memory error? Or even mitigate future btrfs errors?

btrfs errors for no apparent reason on RN212

15 Replies

Replies have been turned off for this discussion

StephenB
Guru - Experienced User
Oct 07, 2023
milind2021 wrote:

BTRFS error (device md127): parent transid verify failed on 211156992 wanted 823860 found 823859

Any ideas what this means? Someone on StackOverflow thinks this has something to do with NFS — this seems sensible to me.

Nothing to do with NFS. Note the update on that post - a couple days later the problem came back. These are linked to the earlier corrupt leaf problem with BTRFS. The on-disk metadata structures of your file system are damaged.

Disk errors (including bit rot) of course can corrupt these structures. There are other possibilties - power loss, crashes or improper shutdowns result in cached writes never making it to the disk. If the file system becomes completely full, the metadata updates might not happen properly. The damage might not show up right away, particularly if you don't reboot very often.

milind2021 wrote:

"Volume: The volume test1 encountered an error and was made read-only. It is recommended to backup your data."

This was good advice, since I hadn't properly backed up the contents of the NAS (I had a couple of other HDDs which held the data before they went onto this NAS). I went ahead, bought a 4TB WD Blue, slotted it into an external HDD case, attached it via eSATA, and ran the backup jobs which have completed. I will test the backup soon and then proceed to try fixing the btrfs issue.

You did exactly the right thing. Rebooting the NAS likely would have resulted in a lost volume at that point, and while fixing a btrfs error is sometimes possible, it still generally will result in some data loss.

milind2021 wrote:

Will bitrot protection help to prevent these situations?

Maybe it will help some, but no guarantees. It has occasionally kicked in over the years on my own NAS, but I haven't seen it actually succeed in repairing something.

If you aren't using a UPS with the NAS, then I recommend getting one. That eliminates the risk of an unexpected power loss corrupting the file system.

I think keeping the volume below 90% full is one important measure - BTRFS metadata/on-disk structures use the same storage pool as the files themselves, so a full disk can lead to serious problems.

If you use snapshots, then I recommend avoiding the so-called "smart" snapshot setting. Monthly snapshots are maintained forever, which eventually fills the volume. Instead use the custom snapshots, and explicitly set the retention you want. Set the "only make snapshots when there are changes" setting to reduce the clutter. Then watch the snapshot space, and balance the retention against the acceptable overhead.

I also recommend scheduling the maintenance functions. Both the disk test and scrub exercise the disks (reading every sector), so scrub also can double as a diagnostic. I cycle through one test per month, so each test is run 3 times a year. My schedule puts something between scrub and test (for example, test, balance, scrub, defrag), so every sector of the disks is read every other month. The idea here is to get early warning of disk problems - particularly important if you archive a lot of files that are only rarely read.

Note you need to keep the NAS powered during these tests - the NAS won't resume them on power up.

milind2021 wrote:

Of course, I can redo the volume, but this error has come with no warning, no SMART errors, nothing - I would like to root-cause this if possible

While it might be educational to troubleshoot this, I suggest redoing the volume anyway when you are done.

These are rare (I have five OS-6 NAS running for a long time now, and have never seen it on my own systems). And BTRFS repair is considered dangerous/risky by the BTRFS teams. Even if the repair appears to work, there could be some residual damage that would be hard to spot (and which will cause issues later on). So after I've helped folks remount failed volumes, I always recommend that they make a backup and then start over with a clean file system.

As far as SMART goes, I have had disk errors that never show up in the SMART stats (in fact a lot of disks fail with no SMART errors). Most disks now log the errors internally, and you can see those errors with smartctl -x. I have seen UNCs in those logs that never showed up anywhere else.

You can also run the full disk test manually with smartctl. Though that just confirms that the disk can be read, since it doesn't try to write anything.

milind2021 wrote:

I'm not ready to give up and resort to backups every time this happens,

RAID is not enough to keep your data safe - the best strategy for that is to copies on other devices. You were fortunate this time, as you saw the error message and followed the advice it gave. That might not happen next time. So it is important to put a backup plan in place.

Note physical loss is one scenario (fire, flood, theft, power surges, ...), so cloud backup or off-site storage should be part of your backup plan. Malware encryption is another scenario to consider.

Personally I back up my primary NAS to other NAS daily, and augment that with cloud backup. Backup NAS are powered off at other times, in order to slow any malware spread to the backups. As my primary storage grows, I increase the backup storage to accomodate it. Although this costs, data recovery is more expensive than backup (and often fails to recover what you need).
- milind2021
  Aspirant
  Oct 07, 2023
  Thanks for that detailed reply. A few points that you raised:
  The NAS is powered by an APC UPS connected via USB, which triggers NAS shutdown when the battery goes low. I don't expect there to be more than a few abrupt power-loss events in total, and none since maybe a year.
  The volume is 70-75% full as of now.
  I am not yet using snapshots, but I know I'll have to use them once I begin regularly backing up.
  I have not yet set up the maintenance tasks, good point, I will make sure to use it going forward.
  Thanks for the "smartctl -x" command, but I don't see anything being flagged, but I found a couple of lines on the Seagate Ironwolf (the other drive is a WD Red CMR) that might be problematic:
  SMART Attributes Data Structure revision number: 10
  Vendor Specific SMART Attributes with Thresholds:
  ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate POSR-- 117 099 006 - 166373216
  3 Spin_Up_Time PO---- 096 096 000 - 0
  4 Start_Stop_Count -O--CK 100 100 020 - 164
  5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
  7 Seek_Error_Rate POSR-- 083 060 030 - 201372729
  9 Power_On_Hours -O--CK 074 074 000 - 22879
  10 Spin_Retry_Count PO--C- 100 100 097 - 0
  12 Power_Cycle_Count -O--CK 100 100 020 - 164
  184 End-to-End_Error -O--CK 100 100 099 - 0
  187 Reported_Uncorrect -O--CK 100 100 000 - 0
  188 Command_Timeout -O--CK 100 100 000 - 0
  189 High_Fly_Writes -O-RCK 054 054 000 - 46
  190 Airflow_Temperature_Cel -O---K 063 052 045 - 37 (Min/Max 24/37)
  191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
  192 Power-Off_Retract_Count -O--CK 100 100 000 - 91
  193 Load_Cycle_Count -O--CK 100 100 000 - 193
  194 Temperature_Celsius -O---K 037 048 000 - 37 (0 20 0 0 0)
  197 Current_Pending_Sector -O--C- 100 100 000 - 0
  198 Offline_Uncorrectable ----C- 100 100 000 - 0
  199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
  ||||||_ K auto-keep
  |||||__ C event count
  ||||___ R error rate
  |||____ S speed/performance
  ||_____ O updated online
  |______ P prefailure warning
  The raw read and seek error rates seem to be high, but the "value" seems to indicate that's just the internal way they're recorded and that they're not an issue. The WD drive did not have any such issues.
  Do you recommend a memory test? Because a number of other posts on the internet on btrfs point out that disk corruption is unlikely to cause these issues, as against RAM corruption.
  - StephenB
    Guru - Experienced User
    Oct 07, 2023
    I don't see anything concerning in the Seagate SMART stats.
    
    milind2021 wrote:
    
    Do you recommend a memory test? Because a number of other posts on the internet on btrfs point out that disk corruption is unlikely to cause these issues, as against RAM corruption.
    
    Not sure memory corruption being more likely, but there is no harm in running the memory test.

Forum Discussion

btrfs errors for no apparent reason on RN212

15 Replies

Related Content

DNS_PROBE_FINISHED_NXDOMAIN Error

Add reservation error

Firmware check error

Cannot Check/Update Firmware - Communication Error

XR1000 error

NETGEAR Academy

ProSupport for Business