NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.

Forum Discussion

milind2021's avatar
milind2021
Aspirant
Oct 07, 2023

btrfs errors for no apparent reason on RN212

I have been using this Readynas for the past 1.5 years, but all of a sudden on Oct 4, I found I couldn't write to the share. Digging revealed a btrfs problem in dmesg log:

BTRFS critical (device md127): corrupt leaf, slot offset bad: block=1064206336, root=1, slot=287
BTRFS critical (device md127): corrupt leaf, slot offset bad: block=1064206336, root=1, slot=287
BTRFS: error (device md127) in btrfs_run_delayed_refs:2995: errno=-5 IO failure
BTRFS info (device md127): forced readonly

 with the following advice in the logs on the Web UI : "Volume: The volume test1 encountered an error and was made read-only. It is recommended to backup your data."

This was good advice, since I hadn't properly backed up the contents of the NAS (I had a couple of other HDDs which held the data before they went onto this NAS). I went ahead, bought a 4TB WD Blue, slotted it into an external HDD case, attached it via eSATA, and ran the backup jobs which have completed. I will test the backup soon and then proceed to try fixing the btrfs issue.

Around the time I was getting used to the backup UI, I encountered a repeated bunch (1000s) of errors that still continue to this moment:

BTRFS error (device md127): parent transid verify failed on 211156992 wanted 823860 found 823859

Any ideas what this means? Someone on StackOverflow thinks this has something to do with NFS — this seems sensible to me. Will bitrot protection help to prevent these situations?

Of course, I can redo the volume, but this error has come with no warning, no SMART errors, nothing - I would like to root-cause this if possible, esp given that now this entire line of hardware is unsupported and end-of-life. I'm not ready to give up and resort to backups every time this happens, if this is not an isolated occurrence.

15 Replies

Replies have been turned off for this discussion
  • StephenB's avatar
    StephenB
    Guru - Experienced User

    milind2021 wrote:

     

    BTRFS error (device md127): parent transid verify failed on 211156992 wanted 823860 found 823859

    Any ideas what this means? Someone on StackOverflow thinks this has something to do with NFS — this seems sensible to me.


    Nothing to do with NFS. Note the update on that post - a couple days later the problem came back. These are linked to the earlier corrupt leaf problem with BTRFS.  The on-disk metadata structures of your file system are damaged.

     

    Disk errors (including bit rot) of course can corrupt these structures.  There are other possibilties - power loss, crashes or improper shutdowns result in cached writes never making it to the disk.  If the file system becomes completely full, the metadata updates might not happen properly.  The damage might not show up right away, particularly if you don't reboot very often.

     


    milind2021 wrote:

    "Volume: The volume test1 encountered an error and was made read-only. It is recommended to backup your data."

    This was good advice, since I hadn't properly backed up the contents of the NAS (I had a couple of other HDDs which held the data before they went onto this NAS). I went ahead, bought a 4TB WD Blue, slotted it into an external HDD case, attached it via eSATA, and ran the backup jobs which have completed. I will test the backup soon and then proceed to try fixing the btrfs issue.

     


    You did exactly the right thing.  Rebooting the NAS likely would have resulted in a lost volume at that point, and while fixing a btrfs error is sometimes possible, it still generally will result in some data loss.

     

     


    milind2021 wrote:

    Will bitrot protection help to prevent these situations?

     


    Maybe it will help some, but no guarantees.  It has occasionally kicked in over the years on my own NAS, but I haven't seen it actually succeed in repairing something.

     

    If you aren't using a UPS with the NAS, then I recommend getting one.  That eliminates the risk of an unexpected power loss corrupting the file system.

     

    I think keeping the volume below 90% full is one important measure - BTRFS metadata/on-disk structures use the same storage pool as the files themselves, so a full disk can lead to serious problems. 

     

    If you use snapshots, then I recommend avoiding the so-called "smart" snapshot setting.  Monthly snapshots are maintained forever, which eventually fills the volume.  Instead use the custom snapshots, and explicitly set the retention you want.  Set the "only make snapshots when there are changes" setting to reduce the clutter.  Then watch the snapshot space, and balance the retention against the acceptable overhead.

     

    I also recommend scheduling the maintenance functions.  Both the disk test and scrub exercise the disks (reading every sector), so scrub also can double as a diagnostic.  I cycle through one test per month, so each test is run 3 times a year.  My schedule puts something between scrub and test (for example, test, balance, scrub, defrag), so every sector of the disks is read every other month.  The idea here is to get early warning of disk problems - particularly important if you archive a lot of files that are only rarely read.

     

    Note you need to keep the NAS powered during these tests - the NAS won't resume them on power up.

     

     


    milind2021 wrote:

     

    Of course, I can redo the volume, but this error has come with no warning, no SMART errors, nothing - I would like to root-cause this if possible


    While it might be educational to troubleshoot this, I suggest redoing the volume anyway when you are done.

     

    These are rare (I have five OS-6 NAS running for a long time now, and have never seen it on my own systems).  And BTRFS repair is considered dangerous/risky by the BTRFS teams.  Even if the repair appears to work, there could be some residual damage that would be hard to spot (and which will cause issues later on).  So after I've helped folks remount failed volumes, I always recommend that they make a backup and then start over with a clean file system.

     

    As far as SMART goes, I have had disk errors that never show up in the SMART stats (in fact a lot of disks fail with no SMART errors).  Most disks now log the errors internally, and you can see those errors with smartctl -x.  I have seen UNCs in those logs that never showed up anywhere else.

     

    You can also run the full disk test manually with smartctl.  Though that just confirms that the disk can be read, since it doesn't try to write anything.

     


    milind2021 wrote:

    I'm not ready to give up and resort to backups every time this happens,


    RAID is not enough to keep your data safe - the best strategy for that is to copies on other devices. You were fortunate this time, as you saw the error message and followed the advice it gave.  That might not happen next time. So it is important to put a backup plan in place.

     

    Note physical loss is one scenario (fire, flood, theft, power surges, ...), so cloud backup or off-site storage should be part of your backup plan.  Malware encryption is another scenario to consider.

     

    Personally I back up my primary NAS to other NAS daily, and augment that with cloud backup.  Backup NAS are powered off at other times, in order to slow any malware spread to the backups. As my primary storage grows, I increase the backup storage to accomodate it.  Although this costs, data recovery is more expensive than backup (and often fails to recover what you need).

    • milind2021's avatar
      milind2021
      Aspirant

      Thanks for that detailed reply. A few points that you raised:

      1. The NAS is powered by an APC UPS connected via USB, which triggers NAS shutdown when the battery goes low. I don't expect there to be more than a few abrupt power-loss events in total, and none since maybe a year.
      2. The volume is 70-75% full as of now.
      3. I am not yet using snapshots, but I know I'll have to use them once I begin regularly backing up.
      4. I have not yet set up the maintenance tasks, good point, I will make sure to use it going forward.

      Thanks for the "smartctl -x" command, but I don't see anything being flagged, but I found a couple of lines on the Seagate Ironwolf (the other drive is a WD Red CMR) that might be problematic:

      SMART Attributes Data Structure revision number: 10
      Vendor Specific SMART Attributes with Thresholds:
      ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
      1 Raw_Read_Error_Rate POSR-- 117 099 006 - 166373216
      3 Spin_Up_Time PO---- 096 096 000 - 0
      4 Start_Stop_Count -O--CK 100 100 020 - 164
      5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
      7 Seek_Error_Rate POSR-- 083 060 030 - 201372729
      9 Power_On_Hours -O--CK 074 074 000 - 22879
      10 Spin_Retry_Count PO--C- 100 100 097 - 0
      12 Power_Cycle_Count -O--CK 100 100 020 - 164
      184 End-to-End_Error -O--CK 100 100 099 - 0
      187 Reported_Uncorrect -O--CK 100 100 000 - 0
      188 Command_Timeout -O--CK 100 100 000 - 0
      189 High_Fly_Writes -O-RCK 054 054 000 - 46
      190 Airflow_Temperature_Cel -O---K 063 052 045 - 37 (Min/Max 24/37)
      191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
      192 Power-Off_Retract_Count -O--CK 100 100 000 - 91
      193 Load_Cycle_Count -O--CK 100 100 000 - 193
      194 Temperature_Celsius -O---K 037 048 000 - 37 (0 20 0 0 0)
      197 Current_Pending_Sector -O--C- 100 100 000 - 0
      198 Offline_Uncorrectable ----C- 100 100 000 - 0
      199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
      ||||||_ K auto-keep
      |||||__ C event count
      ||||___ R error rate
      |||____ S speed/performance
      ||_____ O updated online
      |______ P prefailure warning

       The raw read and seek error rates seem to be high, but the "value" seems to indicate that's just the internal way they're recorded and that they're not an issue. The WD drive did not have any such issues.

      Do you recommend a memory test? Because a number of other posts on the internet on btrfs point out that disk corruption is unlikely to cause these issues, as against RAM corruption.

      • StephenB's avatar
        StephenB
        Guru - Experienced User

        I don't see anything concerning in the Seagate SMART stats.

         


        milind2021 wrote:

         

        Do you recommend a memory test? Because a number of other posts on the internet on btrfs point out that disk corruption is unlikely to cause these issues, as against RAM corruption.


        Not sure memory corruption being more likely, but there is no harm in running the memory test.

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology! 

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

 

Learn More