NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.
Forum Discussion
milind2021
Oct 07, 2023Aspirant
btrfs errors for no apparent reason on RN212
I have been using this Readynas for the past 1.5 years, but all of a sudden on Oct 4, I found I couldn't write to the share. Digging revealed a btrfs problem in dmesg log: BTRFS critical (device md1...
milind2021
Oct 07, 2023Aspirant
Thanks for that detailed reply. A few points that you raised:
- The NAS is powered by an APC UPS connected via USB, which triggers NAS shutdown when the battery goes low. I don't expect there to be more than a few abrupt power-loss events in total, and none since maybe a year.
- The volume is 70-75% full as of now.
- I am not yet using snapshots, but I know I'll have to use them once I begin regularly backing up.
- I have not yet set up the maintenance tasks, good point, I will make sure to use it going forward.
Thanks for the "smartctl -x" command, but I don't see anything being flagged, but I found a couple of lines on the Seagate Ironwolf (the other drive is a WD Red CMR) that might be problematic:
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 117 099 006 - 166373216
3 Spin_Up_Time PO---- 096 096 000 - 0
4 Start_Stop_Count -O--CK 100 100 020 - 164
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
7 Seek_Error_Rate POSR-- 083 060 030 - 201372729
9 Power_On_Hours -O--CK 074 074 000 - 22879
10 Spin_Retry_Count PO--C- 100 100 097 - 0
12 Power_Cycle_Count -O--CK 100 100 020 - 164
184 End-to-End_Error -O--CK 100 100 099 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
188 Command_Timeout -O--CK 100 100 000 - 0
189 High_Fly_Writes -O-RCK 054 054 000 - 46
190 Airflow_Temperature_Cel -O---K 063 052 045 - 37 (Min/Max 24/37)
191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
192 Power-Off_Retract_Count -O--CK 100 100 000 - 91
193 Load_Cycle_Count -O--CK 100 100 000 - 193
194 Temperature_Celsius -O---K 037 048 000 - 37 (0 20 0 0 0)
197 Current_Pending_Sector -O--C- 100 100 000 - 0
198 Offline_Uncorrectable ----C- 100 100 000 - 0
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
The raw read and seek error rates seem to be high, but the "value" seems to indicate that's just the internal way they're recorded and that they're not an issue. The WD drive did not have any such issues.
Do you recommend a memory test? Because a number of other posts on the internet on btrfs point out that disk corruption is unlikely to cause these issues, as against RAM corruption.
StephenB
Oct 07, 2023Guru - Experienced User
I don't see anything concerning in the Seagate SMART stats.
milind2021 wrote:
Do you recommend a memory test? Because a number of other posts on the internet on btrfs point out that disk corruption is unlikely to cause these issues, as against RAM corruption.
Not sure memory corruption being more likely, but there is no harm in running the memory test.
- milind2021Oct 08, 2023Aspirant
So a bunch of updates after I went on the btrfs IRC channel, and had some guidance:
- All errors after forced readonly are spurious because in-kernel state of FS is different from on-disk state. Their advice was to ignore the parent transid error that came after the forced-readonly state.
- The error seems to be in metadata rather than in data, which they said is worse than if it happened to data.
- The corrupt leaf issue seems to be one item (#287 as indicated by the kernel log) in the dump-tree of that block (again indicated in that kernel message), with its ref and gen both left shifted by 4 bytes. Zygo from IRC list says that can happen with RAM clock slip (off-by-one on 32-bit machine).
- I was unable to unmount the volume in the forced readonly state, and so I couldn't run scrub or check
- I rebooted, it came up fine with RW mounting, but the corrupt leaf error still showed up on kernel log
- Scrub still fails when it encounters corruption, because this is host-level corruption (RAM etc), rather than device-level (so all FS copies of metadata are bad and hence scrub can't do anything).
- Check (readonly) points out large number (100ish) of instances of around 3 errors.
- I ran btrfs-check once with FS mounted RW, then rebooted into "volume read-only" mode where FS was mounted RO and then ran it, and then rebooted into "tech-support mode" where FS was unmounted and then ran it again. All three times it found errors, but the logs were mildly different.
I will upload exact logs later in the day. Note that I'm not attempting to recover data, I only want to diagnose the root cause if any. Next steps I took up are:
- Check the filesystem from a later version of btrfs-progs since v4.4 is quite old and meanwhile it has gotten a lot better at checking.
- Run a memory test.
I rebooted into memory test mode, it started up with blinking the "Backup" light as it should, but about 2-5 minutes later it stops blinking and Power, Disk1 and Disk2 LEDs light up and stay on solid. RAIDar doesn't detect anything during this entire phase.
Does this look like a warranty issue? I really need to run a memory check...
- milind2021Oct 08, 2023Aspirant
I'm currently running OS v6.10.5 hotfix 1. Will updating to 6.10.8 help in fixing the memory error? Or even mitigate future btrfs errors?
- StephenBOct 08, 2023Guru - Experienced User
milind2021 wrote:
Will updating to 6.10.8 help in fixing the memory error?
Not sure what the premature halting of the memory test means.
If all the LEDs are ON, then it is documented as a pass, but the test is supposed to run 8 hours, not just a couple of minutes. If the backup LED is off, then the LED status isn't documented in the hardware manual.
You could try to install a memory test program, and then use that to confirm the result.
But assuming it is a failure, unfortunately there's nothing you can do about it. It's a hardware failure, not software, and the memory is soldered onto the system board.
milind2021 wrote:
Or even mitigate future btrfs errors?
The release notes for OS 6.10.6 through 6.10.9 don't mention BTRFS.
Failing RAM can of course result in on-disk corruption, and upgrading the firmware can't prevent that.
Related Content
NETGEAR Academy
Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology!
Join Us!