Re: btrfs corruption AGAIN

ronlaws86 · ‎2018-05-31

So I noticed today after trying to write a file to the nas that it was read only (Access denied) coudl not even edit files from the admin page.

Sure enough i went in to SSH to peek at the dmesg output and this message filled my scrollbuffer in seconds

[3740002.200944] BTRFS error (device md127): bad tree block start 9800114264141725795 18105397411840
[3740002.201191] BTRFS error (device md127): bad tree block start 9800114264141725795 18105397411840
[3740122.654060] BTRFS error (device md127): bad tree block start 9800114264141725795 18105397411840
[3740122.654306] BTRFS error (device md127): bad tree block start 9800114264141725795 18105397411840
[3740243.107223] BTRFS error (device md127): bad tree block start 9800114264141725795 18105397411840
[3740243.107472] BTRFS error (device md127): bad tree block start 9800114264141725795 18105397411840
[3740363.560247] BTRFS error (device md127): bad tree block start 9800114264141725795 18105397411840

i've Truncated it, but this message is repeated more times than i can scroll up to view the start.

I also notice that the weekly scrubs have been failing, suggesting this issue started several weeks back.

For now i am leaving it powere on in this odd state of limbo, the last time this happened i rebooted it, and lost my RAID5 array.

..Seriously how can anybody consider this filesystem production ready this is the 8th time in 2 years now.

ronlaws86 · ‎2019-09-17

So a quick update to this. I know it's been a year since my last post, but since this shows up in google now I may as well put a closing comment.

Since my last post, Volume failures have continued and i've pretty much just come to accept this to be a quirk of bad implementation on Netgear's part and that these devices are simply unreliable due to the well known failings of BTRFS as a file system in general everywhere else in the linux community. I really wish the devices used XFS, but alas; we're stuck with the poor design choices Netgear gave us - short of hacking and flashing them with something else.

In subsequent failures, i've not even bothered SSH'ing in to the devices, and used only factory provided tools (Exluding SSH which is still factory provided too btw) and used instead regular backup options as well as ReadyDR. - the file system still crashes. even at 90% which is a huge waste of otherwise perfectly usable free space.

Regular volume house keeping has always been in place, weekly scrubs, monthly defrags, etc.

Currently in the process of switching the disks over from Seagate Barracudas to WD Reds with X2 the capacity to hopefully mitigate the volume almost full self destruct issue that shouldn't exist in the first place (On any sane file system)

But if even this fails to help and i end up once again with a busted volume later down the line, my advise to the general populace at the moment would be "Don't use these devices for anything mission critical where you expect the free space to become limited. If you do, don't trust the RAID configurations and stick to single disk shares, as bugs in BTRFS regarding RAID will likely leave you with a busted volume 6 months down the line."
Until Netgear either fix the bugs in BTRFS (Unlikly) Or switch to a more mature and reliable Filesystem (Like XFS) these NAS drives are volatile at best.

View solution in original post

StephenB · ‎2018-05-31

@ronlaws86 wrote:

I also notice that the weekly scrubs have been failing, suggesting this issue started several weeks back.

Did you get any email alerts on the scrubs failing?

Also I am wondering if you are seeing issues in the SMART stats for any of the disks.

ronlaws86 · ‎2018-05-31

HI Stephen.

Yes, on the 13th I Have an e-mail to say the scrub had failed.

Here is a pastebin of the smartctl output

I see no reallocated sectors, though the ~~read/write error count seems a bit disconcerting,~~(Apparently normal for seagate drives? wth) These were brand new Seagate Barracuda drives though, purchased at the same time as the NAS Drives.

On a side note; I ran a manual scrub over night via ssh after shutting down all services on the nas and unmounting the array,

issuing

echo repair > /sys/block/md127/md/sync_action

and according to dmesg, it completed without error.

[3742947.701351] md: requested-resync of RAID array md127
[3742947.701361] md: minimum _guaranteed_  speed: 30000 KB/sec/disk.
[3742947.701368] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync.
[3742947.701377] md: using 128k window, over a total of 1948664832k.
[3764596.073548] md: md127: requested-resync done.

ronlaws86 · ‎2018-06-01

Quick update:

I rebooted the nas after veryfying the array was sat idle and no scans were going on; sure enough the array has come back up, however the filesystem is totally dead.

root@INT-NAS-1:~# mount /dev/md127 /data
mount: wrong fs type, bad option, bad superblock on /dev/md127,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
root@INT-NAS-1:~# btrfsck --repair /dev/md127
enabling repair mode
bytenr mismatch, want=18105397379072, have=16016835313664
ERROR: cannot read chunk root
ERROR: cannot open file system
root@INT-NAS-1:~#

dmesg:

[  154.994060] BTRFS info (device md127): has skinny extents
[  154.995265] BTRFS critical (device md127): unable to find logical 513211334656 len 4096
[  154.995278] BTRFS critical (device md127): unable to find logical 513211334656 len 4096
[  154.995321] BTRFS critical (device md127): unable to find logical 513211334656 len 4096
[  154.995330] BTRFS critical (device md127): unable to find logical 513211334656 len 4096
[  154.995358] BTRFS critical (device md127): unable to find logical 513211334656 len 4096
[  154.995367] BTRFS critical (device md127): unable to find logical 513211334656 len 4096
[  154.995384] BTRFS error (device md127): failed to read chunk root
[  155.057373] BTRFS error (device md127): open_ctree failed

mdgm-ntgr · ‎2018-06-01

Please send us your logs (see the Sending Logs link in my sig)

ronlaws86 · ‎2018-06-01

Logs have been sent.

mdgm-ntgr · ‎2018-06-02

I can see that you've been using SSH, so I can't rule out that contributing to the problem.

Also care needs to be taken when running destructive commands, like a btrfsck --repair. If it's a problem that a repair won't fix, running that command may bake in the problem making any subsequent data recovery attempt much less likely to succeed.

It's very odd that you've run into this problem so many times. Have you tried running the memory (RAM) test boot menu option?

ronlaws86 · ‎2018-06-02

Thanks for the reply; I've not tried a RAM test yet no, but im curious what exactly if anything i've done would cause btrfs to completely die on its own accord, I generally only go in there to do non-invasive changes to services or run an iperf test between them, or tweak a share - or set up SSH keys between the 2 units. I don't go in to tweak anything related to btrfs or run any commands against it, these drives spend most of their time running frontview backups jobs on an hourly basis to capture data for backups - Data loss isn't a huge concern, since they are backed up and actually themself a backup of another source of data, it's just annoying that this seems to happen roughly every few months and this time the drive wasn't even at more than 60% capacity when i looked.

Edit: i also want to point out that the file system modules in the kernel seem to have issues. One of the units i rotate a USB HDD for offsite backups on a nightly bases (swap, take one home) and i found if i use EXT4 as the file system the NAS Hangs with an mb_cache_entry_g error. (see attached photo) after weeks of head scratching, this issue stopped when i formatted the drive to btrfs, so i'm more likly to suspect the kernel modules are to blame for random filesystem death than my benign use of ssh.

ronlaws86 · ‎2018-06-04

RAM test did not produce any errors, I am considering a full factory rest however it would be nice to figure out why this keeps happening..

StephenB · ‎2018-06-04

Do you sometimes experience power issues that result in an unclean shutdown of the NAS?

That could result in an out-of-sync raid array - normally I'd expect a different symptom (inactive volume), but it might be possible that lost writes could result in corrupted file system.

ronlaws86 · ‎2018-06-04

Now that you mention it, there had been a couple of occasions where the NAS locked up, in this instance unrelated to the above issue with the other unit and a USB drive, but in those situtations a forced power off is the only option (Power button unresponsive, no network access etc.) .. though there were no immediate issues upon reboot. besides that no, the unit is connected to a UPS supply beside all the company servers.

StephenB · ‎2018-06-04

@ronlaws86 wrote:

Now that you mention it, there had been a couple of occasions where the NAS locked up, in this instance unrelated to the above issue with the other unit and a USB drive, but in those situtations a forced power off is the only option (Power button unresponsive, no network access etc.) .. though there were no immediate issues upon reboot.

That might be the cause. Though I've sometimes had to do the same, with no ill effects.

ronlaws86 · ‎2018-06-04

The filesystem is totally unrecoverable so i've done a full factory reset; I also removed the drives and wiped the partition table to force it to start completely from scratch. Currently waiting for the first sync to complete before i go adding shares and setting up the backups.

One observation I do have is this issue seems to be this model only in my experience, At home I have a 104 which has been running for years and never had issues like this, even with a mostly full file system. Also never had it crash reading/writing EXT4 usb drives, which there are plenty of plugged in to it.. So perhaps there's actually a bug in the kernel against the 214 that's causing these odd glitches and file system death, if i use a USB HDD formatted as EXT4 on this model with an overnight backup at some point it will hang mid transfer.

Ikalou · ‎2018-06-04

The same thing happend to me today. All the shares now appear empty. I never used SSH and didn't do anything unusual. Using the latest firmware (6.9.3).

[Mon Jun  4 12:07:15 2018] BTRFS critical (device md127): unable to find logical 13599006130176 len 4096
[Mon Jun  4 12:07:15 2018] BTRFS critical (device md127): unable to find logical 13599006130176 len 4096
[Mon Jun  4 12:07:15 2018] BTRFS info (device md127): no csum found for inode 8749 start 406678265856
[Mon Jun  4 12:07:15 2018] BTRFS critical (device md127): unable to find logical 13599006130176 len 4096
[Mon Jun  4 12:07:15 2018] BTRFS critical (device md127): unable to find logical 13599006130176 len 4096
[Mon Jun  4 12:07:15 2018] BTRFS critical (device md127): unable to find logical 13599006130176 len 4096
[Mon Jun  4 12:07:15 2018] BTRFS critical (device md127): unable to find logical 13599006130176 len 4096
[Mon Jun  4 12:07:15 2018] BTRFS critical (device md127): unable to find logical 13599006130176 len 4096
[Mon Jun  4 12:07:15 2018] BTRFS critical (device md127): unable to find logical 13599006130176 len 4096
[Mon Jun  4 12:07:15 2018] BTRFS critical (device md127): unable to find logical 13599006130176 len 4096

I'm going to try to run a scrub but I think i'll have to reinstall the NAS.

mdgm-ntgr · ‎2018-06-05

The volume maintenance options in the GUI are not the way to deal with problems like this.

If you don’t have an up to date backup and need a data recovery attempt you could contact support.

ronlaws86 · ‎2019-09-17

So a quick update to this. I know it's been a year since my last post, but since this shows up in google now I may as well put a closing comment.

Since my last post, Volume failures have continued and i've pretty much just come to accept this to be a quirk of bad implementation on Netgear's part and that these devices are simply unreliable due to the well known failings of BTRFS as a file system in general everywhere else in the linux community. I really wish the devices used XFS, but alas; we're stuck with the poor design choices Netgear gave us - short of hacking and flashing them with something else.

In subsequent failures, i've not even bothered SSH'ing in to the devices, and used only factory provided tools (Exluding SSH which is still factory provided too btw) and used instead regular backup options as well as ReadyDR. - the file system still crashes. even at 90% which is a huge waste of otherwise perfectly usable free space.

Regular volume house keeping has always been in place, weekly scrubs, monthly defrags, etc.

Currently in the process of switching the disks over from Seagate Barracudas to WD Reds with X2 the capacity to hopefully mitigate the volume almost full self destruct issue that shouldn't exist in the first place (On any sane file system)

But if even this fails to help and i end up once again with a busted volume later down the line, my advise to the general populace at the moment would be "Don't use these devices for anything mission critical where you expect the free space to become limited. If you do, don't trust the RAID configurations and stick to single disk shares, as bugs in BTRFS regarding RAID will likely leave you with a busted volume 6 months down the line."
Until Netgear either fix the bugs in BTRFS (Unlikly) Or switch to a more mature and reliable Filesystem (Like XFS) these NAS drives are volatile at best.

JohnCM_S · ‎2019-09-17

Hi ronlaws86,

We have released a ReadyNASOS 6.10.2-T49 (Beta 1) firmware which includes an improved BTRFS stability on ARM units (102/104/2120/202/204/212/214). This may lessen the file system issues on those units (Inactive Volume or Volume offline). You may try that firmware and see if there will be an improvement.

You can download it here: https://community.netgear.com/t5/ReadyNAS-Beta/ReadyNASOS-6-10-2-T49-Beta-1/m-p/1792843#M10659

Regards,

Sandshark · ‎2019-09-17

The 104 does have limited memory. Are you running any apps that might use that up and cause the lock-ups? They may be related -- the lock-up directly causing the corruption or the hard reset doing so. Read errors on the drive can also cause lock-ups. I have seen that in some of my experiments, where I often use some old drives I retired due to too many errors.

As for your comment on BTRFS stability, I see a lot about BTRFS RAID instability, but the ReadyNAS uses MDADM RAID with BTRFS on top to avoid that. I have had only one volume go bad on me in multiple machines (excluding the experiments where a known bad drive or my schenanigans likely caused it), and that was on an EDA500 where the cable had come partly loose. But, my only ARM machine is a 102 I have just for experiments. Maybe the latest improvments for ARM will make your experience more like mine; which is to say, trouble free.

ronlaws86 · ‎2019-09-18

I have a 104 at home and that's actually been solid, though it sees little to no traffic compared to the work ones (214) which seem to have issues in cycles. (Different ARM Chip also.) and I did also swap the ram module for a larger one a while back (Fun fact, same SODIMM module as a laptop)

Since the last year, neither NAS drives use any apps or other software. their sole purpose at this point is just as backup devices, so NFS/CIFS and ReadyDR are the only services active.

unit 2, which sits in another building doesn't even share data. it's only job is to use readyDR to pull data from unit 1 and store that for redundancy. I don't even use the regular backup function for this, just ReadyDR, so it's become effectivly a snapshot store. - That failed about 3 weeks ago when the drive filled.
This is a good read on where BTRFS went wrong

Edit. PS, yes i'm aware it uses MDADM for the underlining raid managment, this is the only part that doesn't fail and at no time even to this day were there any disk errors. It's been purly issues with the BTRFS Volume. - i will try the firmware linked above on one of the units.

ronlaws86 · ‎2019-10-07

6 month mark and it's happened again,

[Mon Oct 7 08:15:56 2019] BTRFS critical (device dm-0😞 unable to find logical 6828490522624 len 4096

(repeated over and over)

This is clearly a BTRFS Volume issue and not a MDADM Array issue. All drives in this array are less than a month old and are brand new WD Red Drives.

DEADDEADBEEF · ‎2019-10-07

what firmware are you on? What's your workflow? Seems weird that you get so much btrfs corruption... Hard shutdowns? Anything out of the ordinary there?

ronlaws86 · ‎2019-10-07

It's the latest Stable release. No shutdowns, the NAS is in a server rack plugged in to a UPS, it pretty much never gets turned off. been running for months since the last update. The corruption just occurs at any time around the 6 month mark.

Workflow wise, it's just a backup unit, it takes in nightly backups from a Standalone VMWare server and throughout the day takes snapshots of staff shares every 2 hours for a period of 12 hours, then archives the snapshots via ReadyDR off to an identicle unit in another building. (which also has the same 6 month self death problem)

once i'm finished retrieving the data off it, (It's currently in read only mode) i'm going to wipe the volume and try once more without Quota this time.

StephenB · ‎2019-10-08

@ronlaws86 wrote:

It's the latest Stable release. No shutdowns, the NAS is in a server rack plugged in to a UPS, it pretty much never gets turned off. been running for months since the last update. The corruption just occurs at any time around the 6 month mark.

Workflow wise, it's just a backup unit, it takes in nightly backups from a Standalone VMWare server and throughout the day takes snapshots of staff shares every 2 hours for a period of 12 hours, then archives the snapshots via ReadyDR off to an identical unit in another building. (which also has the same 6 month self death problem)

It's weird that you have two of these that are misbehaving the same way. FWIW, I've never seen this with on my own ReadyNAS, and I've been running OS-6 since 2013.

The only activity on the remote NAS is ReadyDR (where it is only the destination)?

Is ReadyDR used to back up LUNs, Shares, or both?

How much free space is on the two data volumes?

You might want to try 6.10.2 on one or both of these NAS (since there are some btrfs fixes for ARM NAS listed in it's release notes).

ronlaws86 · ‎2019-10-08

Shares Only. I've not used LUNs since the last corruption case, to rule out that being a cause.

In this particular/recent case, we're talking 10% +/- (9.45TB array, only 1.45TB in use) Snapshots should've taken it to about 40% more or less, but they have all vanished.

Edit: And yes, remote NAS is a desination device only, no direct read/write operations outside of backups.

I too find it rather odd, the 104 i've had for years now at home has never had this problem, the array that's running on it has been the same volume for years and has on occasions come near to full capacity, and i use that for everythign from web server storage, VMWare LUNS and regulard drive data storage/multi media DLNA etc.

ronlaws86 · ‎2019-10-08

Just an update; I uploaded the above beta firmware to that device and rebooted it, the array has come back online. I was preparing to have to start from cratch but interestingly this time it seems to have come back online without doing what i'd have expected the previous versions to do and show a grey/dead/inactive volume.