NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.

Unusable

1 Topic

RN3312 BTRFS operations are completely hung
We have owned the RN3312 for a bit over 6 months, and all was seemingly fine. However, things went downhill recently and now pretty much the entire BTRFS partition is completely unusable at this point. Even leaving the NAS offline and just trying to do whatever internal metadata cleanup by itself in a reasonable time is not enough to recover. What has happened is a combination of the Bit Rot Protection / COW + Compression + Snapshots being turned on, on a partition used for file backups, and image backups (Veeam) for a single, large, fileserver. BTRFS is NOT production ready for such a setup, I firmly believe this option should be removed from the UI, or a huge warning displayed. Everything was going great until the first snapshots needed to be deleted, where I ran into the problem of btrfs-cleaner taking up 100% CPU. Symptoms: the admin UI would lock up on any file operation in certain directories. Directory accesses would hang forever, even over SMB. Of course all the backups to the NAS were timing out. I eventually was able to delete the snapshots by hard rebooting the system and removing them before btrfs-cleaner got too bad. But now, I have the problem where btrfs-transacti is taking up 100% CPU. I have left the system sitting offline for a week just spinning at 100% CPU (!), and there is no visible improvement - EVERY BTRFS operation still hangs, no matter what I try. There is little disk activity, it is not thrashing - makes me think there is something wrong in the internals of BTRFS, or that the CPU is too underpowered to handle the amount of storage metadata operations. top - 12:30:29 up 1:39, 2 users, load average: 115.21, 112.48, 99.52 Tasks: 334 total, 2 running, 332 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.6 us, 23.0 sy, 0.0 ni, 72.1 id, 1.6 wa, 0.0 hi, 2.7 si, 0.0 st KiB Mem: 8113792 total, 2673896 used, 5439896 free, 4404 buffers KiB Swap: 2093052 total, 0 used, 2093052 free. 1980036 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3740 root 20 0 0 0 0 R 100.0 0.0 93:52.62 btrfs-tran+ 1 root 20 0 136632 6868 5144 S 0.0 0.1 0:02.45 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd admin@archive:/data$ iostat Linux 4.4.68.x86_64.1 (archive) 07/11/2017 _x86_64_ (4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.55 0.00 25.73 1.56 0.00 72.15 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 3.26 55.82 57.19 337578 345828 sdb 3.25 55.07 57.08 333044 345196 sdc 3.45 74.32 57.02 449448 344808 sdd 3.28 53.95 57.21 326224 345936 sde 3.26 57.73 57.09 349104 345252 sdf 3.28 54.89 56.87 331951 343908 md0 1.68 27.57 39.28 166740 237520 md1 0.02 0.19 0.00 1172 0 md127 9.38 243.50 69.42 1472516 419788 < not a lot of activity... I have tried starting a balance to fix fragmentation, I believe there are operations blocking it inside the kernel, but even at -dusage=0 I gave up after giving it the weekend to do its thing. Trying to look for evidence is fragmented files is horrendously slow. But it is very bad now: admin@archive:/data$ ls *** hangs forever *** My hope at this point is to try and mount the system read-only and recover data onto a USB drive, the share with data is around 8 TB which might just fit after a couple of days/weeks? of copying... Then figuring out some way to drop the share? and rebuild it without selecting the 'Bit Rot Protection' or 'Compression' options. Hopefully I don't have to resort to copying the NAS to something else and wiping it - there is about 14 TB of data on it currently, and I don't have that much capacity available anywhere else... After going through this and after lots of research, I see lots of horror stories showing that BTRFS is extremely fragile and not ready for prime time. I believe it is reckless for Netgear to base a NAS on such an unproven FS. The features are not worth it if they explode in spectacular fashion after a couple of months. Symptoms include btrfs-transacti and btrfs-endio-wri taking up a lot of CPU time (in spikes, possibly triggered by syncs). You can use filefrag to locate heavily fragmented files (may not work correctly with compression). ... "a balance on 2TB of data that was heavily snapshotted - it took 3 months" "when I have to do balances ... I delete all the snapshots and allow a few months for the balance to finish" https://btrfs.wiki.kernel.org/index.php/Gotchas We are running version 6.7.4. We currently have 6 x 8 TB in X-RAID (certified drives.) I struggle to think what would happen if we filled up all 12 slots... Are there any other operations anyone from support wants to try before I start wiping it? Unfortunately our 90-day free support has expired before any of this happened, so I am left venting in public...
kevinb_vr
Jul 12, 2017 Place Use your ReadyNAS
4.8KViews
0likes
4Comments