Forum Discussion

Follower

Mar 05, 2019

Solved

device is failing?

Hello, I have two RN21400's one is working flawless, the other (2 days before support ended) isn't working anymore, i got a out of memory +324 code, couldnt restart the device so i power cycled it...

Hopchen
Mar 05, 2019
I took a look at the logs. Thanks for sending them over. Here are my thoughts on the issue.

The RAIDs are OK. All running fine.
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md127 : active raid5 sda3[0] sdd3[3] sdc3[2] sdb3[1]
8776250496 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid6 sda2[0] sdd2[3] sdc2[2] sdb2[1]
1047424 blocks super 1.2 level 6, 64k chunk, algorithm 2 [4/4] [UUUU]

md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
4190208 blocks super 1.2 [4/4] [UUUU]

Disks are healthy. No errors on any of them and no disk I/O errors in the kernel log.

RAID volume mounts fine.
/dev/md127 on /data type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=5,subvol=/)

NAS is reading the volume info. All good. I don't see any obvious signs of data corruption.

Total devices 1 FS bytes used 3.60TiB
devid 1 size 8.17TiB used 4.17TiB path /dev/md127

=== filesystem /data ===
Data, single: total=4.17TiB, used=3.60TiB
System, DUP: total=8.00MiB, used=544.00KiB
Metadata, DUP: total=1.00GiB, used=652.53MiB
GlobalReserve, single: total=73.44MiB, used=32.44MiB

Update history looks clean. I don't think the issue is anything to do with corrupt firmware on the NAS.

We have lots of stacks (essentially crashes) from the kernel though. The unit is struggling with memory and running out of memory while handling filesystem (BTRFS) operations.
BTRFS-cleaner is kicking in trying to "clean up" - i.e. it attempts to remove/clean-up the data you requested when you deleted the share. Quotas (qgroup) is also trying calculate.

Mar 04 09:23:20 kernel: btrfs-cleaner invoked oom-killer: gfp_mask=0x2400840, order=0, oom_score_adj=0
Mar 04 09:23:20 kernel: btrfs-cleaner cpuset=/ mems_allowed=0
Mar 04 09:23:20 kernel: CPU: 2 PID: 1998 Comm: btrfs-cleaner Tainted: PO 4.4.157.alpine.1 #1
Mar 04 09:23:20 kernel: Hardware name: Annapurna Labs Alpine
Mar 04 09:23:20 kernel: [<c0014690>] (unwind_backtrace) from [<c0011590>] (show_stack+0x10/0x14)
Mar 04 09:23:20 kernel: [<c0011590>] (show_stack) from [<c035d294>] (dump_stack+0x7c/0x9c)
Mar 04 09:23:20 kernel: [<c035d294>] (dump_stack) from [<c00cb410>] (dump_header.constprop.5+0x44/0x174)
Mar 04 09:23:20 kernel: [<c00cb410>] (dump_header.constprop.5) from [<c0095c7c>] (oom_kill_process+0xe8/0x484)
Mar 04 09:23:20 kernel: [<c0095c7c>] (oom_kill_process) from [<c0096334>] (out_of_memory+0x2b8/0x2e8)
Mar 04 09:23:20 kernel: [<c0096334>] (out_of_memory) from [<c0099d54>] (__alloc_pages_nodemask+0x720/0x7b8)
Mar 04 09:23:20 kernel: [<c0099d54>] (__alloc_pages_nodemask) from [<c009326c>] (pagecache_get_page.part.6+0x148/0x1d0)
Mar 04 09:23:20 kernel: [<c009326c>] (pagecache_get_page.part.6) from [<c02a6aac>] (alloc_extent_buffer+0x1a4/0x3bc)
Mar 04 09:23:20 kernel: [<c02a6aac>] (alloc_extent_buffer) from [<c0279170>] (read_tree_block+0xc/0x44)
Mar 04 09:23:20 kernel: [<c0279170>] (read_tree_block) from [<c0259c9c>] (read_block_for_search.constprop.11+0x168/0x318)
Mar 04 09:23:20 kernel: [<c0259c9c>] (read_block_for_search.constprop.11) from [<c025b9d8>] (btrfs_search_slot+0x644/0x844)
Mar 04 09:23:20 kernel: [<c025b9d8>] (btrfs_search_slot) from [<c02e8120>] (find_parent_nodes+0xf4/0x69c)
Mar 04 09:23:20 kernel: [<c02e8120>] (find_parent_nodes) from [<c02e876c>] (btrfs_find_all_roots_safe+0xa4/0xe0)
Mar 04 09:23:20 kernel: [<c02e876c>] (btrfs_find_all_roots_safe) from [<c02e87f4>] (btrfs_find_all_roots+0x4c/0x70)
Mar 04 09:23:20 kernel: [<c02e87f4>] (btrfs_find_all_roots) from [<c02ec058>] (btrfs_qgroup_prepare_account_extents+0x5c/0x90)
Mar 04 09:23:20 kernel: [<c02ec058>] (btrfs_qgroup_prepare_account_extents) from [<c0280ffc>] (btrfs_commit_transaction+0x5a4/0xbf4)
Mar 04 09:23:20 kernel: [<c0280ffc>] (btrfs_commit_transaction) from [<c026a024>] (btrfs_drop_snapshot+0x4a8/0x6c0)
Mar 04 09:23:20 kernel: [<c026a024>] (btrfs_drop_snapshot) from [<c02809ec>] (btrfs_clean_one_deleted_snapshot+0xac/0xb8)
Mar 04 09:23:20 kernel: [<c02809ec>] (btrfs_clean_one_deleted_snapshot) from [<c0278428>] (cleaner_kthread+0xfc/0x1b8)
Mar 04 09:23:20 kernel: [<c0278428>] (cleaner_kthread) from [<c0036c44>] (kthread+0xf4/0x104)
Mar 04 09:23:20 kernel: [<c0036c44>] (kthread) from [<c000e920>] (ret_from_fork+0x14/0x34)

You deleted a share (probably a rather significant amount of data). The filesystem started the process of removing the data and the NAS crashed.

This happens over and over. Every time you boot, the same happens. The BTRFS filesystem is a journaled filesystem so every time you boot the NAS, the filesystem will realise that it never finished the "cleaning up" action and it will start the process again leading to another crash.

You have BTRFS quotas enabled. Quotas shows you handy things like share sizes and snapshot sizes. When deleting/changing a larger amount of data (and especially if the share contained snapshots) the re-calculation done by the quotas feature can be really taxing on the system.

The BTRFS cleaner must be able to run, to finish up the deletion of that share. Quotas will also be trying to re-calculate the space usage on-top of this and it probably overloads the unit. Quotas have been notorious for this type of behaviour and I think this is what is happening here.

It will be an endless loop because every time you boot the NAS it will be a race between the btrfs-cleaner process and the quotas calculations, eating up the memory fast. Therefore, there is no point in trying to keep rebooting it because it won't work and could even induce some filesystem corruption.

=== Options for you ===

If you boot the NAS into Read-Only mode you will be able to access the data and do a backup if needed. Read-Only mode should work because the filesystem cannot make any changes in this mode and thus the NAS will be stable.

You can boot into Read-Only mode via the boot menu: https://kb.netgear.com/22891/How-do-I-access-the-boot-menu-on-my-ReadyNAS-104-204-214-or-314

I reckon that a manual read/write mount of the data volume, along with immediate disable of quotas upon the mount, will likely be enough to allow the btrfs-cleaner to run and the unit should recover. However, you would probably need NETGEAR to do this (unless you are very comfortable with Linux), costing you a data recovery contract of a couple hundred bucks. It is not particularly difficult to do but does require some expertise. This case would have fallen outside normal support anyway and thus didn't matter that your general support contract expired.

If you have an up-to-date backup you can factory reset the NAS and start over. You can also boot into Read-Only mode and take a backup if needed.

Quotas are fine to run with but they are a bit notorious amongst BTRFS people. In the future, I advise that if you are doing a big deletion of something like a share or cleaning up a whole bunch of snapshots then turn off quotas first ---> do the deletion ---> wait a bit an re-enable quotas. It will prevent quotas and btrfs-cleaner running on top of each other.

The above assessment is of course based on a log-set only but seeing the stacks I strongly suspect that quotas is the real culprit here.

Any questions - let me know. Cheers

Hopchen

Prodigy

Mar 05, 2019

I took a look at the logs. Thanks for sending them over. Here are my thoughts on the issue.

The RAIDs are OK. All running fine.
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md127 : active raid5 sda3[0] sdd3[3] sdc3[2] sdb3[1]
8776250496 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid6 sda2[0] sdd2[3] sdc2[2] sdb2[1]
1047424 blocks super 1.2 level 6, 64k chunk, algorithm 2 [4/4] [UUUU]

md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
4190208 blocks super 1.2 [4/4] [UUUU]

Disks are healthy. No errors on any of them and no disk I/O errors in the kernel log.

RAID volume mounts fine.
/dev/md127 on /data type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=5,subvol=/)

NAS is reading the volume info. All good. I don't see any obvious signs of data corruption.

Total devices 1 FS bytes used 3.60TiB
devid 1 size 8.17TiB used 4.17TiB path /dev/md127

=== filesystem /data ===
Data, single: total=4.17TiB, used=3.60TiB
System, DUP: total=8.00MiB, used=544.00KiB
Metadata, DUP: total=1.00GiB, used=652.53MiB
GlobalReserve, single: total=73.44MiB, used=32.44MiB

Update history looks clean. I don't think the issue is anything to do with corrupt firmware on the NAS.

We have lots of stacks (essentially crashes) from the kernel though. The unit is struggling with memory and running out of memory while handling filesystem (BTRFS) operations.
BTRFS-cleaner is kicking in trying to "clean up" - i.e. it attempts to remove/clean-up the data you requested when you deleted the share. Quotas (qgroup) is also trying calculate.

Mar 04 09:23:20 kernel: btrfs-cleaner invoked oom-killer: gfp_mask=0x2400840, order=0, oom_score_adj=0
Mar 04 09:23:20 kernel: btrfs-cleaner cpuset=/ mems_allowed=0
Mar 04 09:23:20 kernel: CPU: 2 PID: 1998 Comm: btrfs-cleaner Tainted: PO 4.4.157.alpine.1 #1
Mar 04 09:23:20 kernel: Hardware name: Annapurna Labs Alpine
Mar 04 09:23:20 kernel: [<c0014690>] (unwind_backtrace) from [<c0011590>] (show_stack+0x10/0x14)
Mar 04 09:23:20 kernel: [<c0011590>] (show_stack) from [<c035d294>] (dump_stack+0x7c/0x9c)
Mar 04 09:23:20 kernel: [<c035d294>] (dump_stack) from [<c00cb410>] (dump_header.constprop.5+0x44/0x174)
Mar 04 09:23:20 kernel: [<c00cb410>] (dump_header.constprop.5) from [<c0095c7c>] (oom_kill_process+0xe8/0x484)
Mar 04 09:23:20 kernel: [<c0095c7c>] (oom_kill_process) from [<c0096334>] (out_of_memory+0x2b8/0x2e8)
Mar 04 09:23:20 kernel: [<c0096334>] (out_of_memory) from [<c0099d54>] (__alloc_pages_nodemask+0x720/0x7b8)
Mar 04 09:23:20 kernel: [<c0099d54>] (__alloc_pages_nodemask) from [<c009326c>] (pagecache_get_page.part.6+0x148/0x1d0)
Mar 04 09:23:20 kernel: [<c009326c>] (pagecache_get_page.part.6) from [<c02a6aac>] (alloc_extent_buffer+0x1a4/0x3bc)
Mar 04 09:23:20 kernel: [<c02a6aac>] (alloc_extent_buffer) from [<c0279170>] (read_tree_block+0xc/0x44)
Mar 04 09:23:20 kernel: [<c0279170>] (read_tree_block) from [<c0259c9c>] (read_block_for_search.constprop.11+0x168/0x318)
Mar 04 09:23:20 kernel: [<c0259c9c>] (read_block_for_search.constprop.11) from [<c025b9d8>] (btrfs_search_slot+0x644/0x844)
Mar 04 09:23:20 kernel: [<c025b9d8>] (btrfs_search_slot) from [<c02e8120>] (find_parent_nodes+0xf4/0x69c)
Mar 04 09:23:20 kernel: [<c02e8120>] (find_parent_nodes) from [<c02e876c>] (btrfs_find_all_roots_safe+0xa4/0xe0)
Mar 04 09:23:20 kernel: [<c02e876c>] (btrfs_find_all_roots_safe) from [<c02e87f4>] (btrfs_find_all_roots+0x4c/0x70)
Mar 04 09:23:20 kernel: [<c02e87f4>] (btrfs_find_all_roots) from [<c02ec058>] (btrfs_qgroup_prepare_account_extents+0x5c/0x90)
Mar 04 09:23:20 kernel: [<c02ec058>] (btrfs_qgroup_prepare_account_extents) from [<c0280ffc>] (btrfs_commit_transaction+0x5a4/0xbf4)
Mar 04 09:23:20 kernel: [<c0280ffc>] (btrfs_commit_transaction) from [<c026a024>] (btrfs_drop_snapshot+0x4a8/0x6c0)
Mar 04 09:23:20 kernel: [<c026a024>] (btrfs_drop_snapshot) from [<c02809ec>] (btrfs_clean_one_deleted_snapshot+0xac/0xb8)
Mar 04 09:23:20 kernel: [<c02809ec>] (btrfs_clean_one_deleted_snapshot) from [<c0278428>] (cleaner_kthread+0xfc/0x1b8)
Mar 04 09:23:20 kernel: [<c0278428>] (cleaner_kthread) from [<c0036c44>] (kthread+0xf4/0x104)
Mar 04 09:23:20 kernel: [<c0036c44>] (kthread) from [<c000e920>] (ret_from_fork+0x14/0x34)

You deleted a share (probably a rather significant amount of data). The filesystem started the process of removing the data and the NAS crashed.

This happens over and over. Every time you boot, the same happens. The BTRFS filesystem is a journaled filesystem so every time you boot the NAS, the filesystem will realise that it never finished the "cleaning up" action and it will start the process again leading to another crash.

You have BTRFS quotas enabled. Quotas shows you handy things like share sizes and snapshot sizes. When deleting/changing a larger amount of data (and especially if the share contained snapshots) the re-calculation done by the quotas feature can be really taxing on the system.

The BTRFS cleaner must be able to run, to finish up the deletion of that share. Quotas will also be trying to re-calculate the space usage on-top of this and it probably overloads the unit. Quotas have been notorious for this type of behaviour and I think this is what is happening here.

It will be an endless loop because every time you boot the NAS it will be a race between the btrfs-cleaner process and the quotas calculations, eating up the memory fast. Therefore, there is no point in trying to keep rebooting it because it won't work and could even induce some filesystem corruption.

=== Options for you ===

If you boot the NAS into Read-Only mode you will be able to access the data and do a backup if needed. Read-Only mode should work because the filesystem cannot make any changes in this mode and thus the NAS will be stable.

You can boot into Read-Only mode via the boot menu: https://kb.netgear.com/22891/How-do-I-access-the-boot-menu-on-my-ReadyNAS-104-204-214-or-314

I reckon that a manual read/write mount of the data volume, along with immediate disable of quotas upon the mount, will likely be enough to allow the btrfs-cleaner to run and the unit should recover. However, you would probably need NETGEAR to do this (unless you are very comfortable with Linux), costing you a data recovery contract of a couple hundred bucks. It is not particularly difficult to do but does require some expertise. This case would have fallen outside normal support anyway and thus didn't matter that your general support contract expired.

If you have an up-to-date backup you can factory reset the NAS and start over. You can also boot into Read-Only mode and take a backup if needed.

Quotas are fine to run with but they are a bit notorious amongst BTRFS people. In the future, I advise that if you are doing a big deletion of something like a share or cleaning up a whole bunch of snapshots then turn off quotas first ---> do the deletion ---> wait a bit an re-enable quotas. It will prevent quotas and btrfs-cleaner running on top of each other.

The above assessment is of course based on a log-set only but seeing the stacks I strongly suspect that quotas is the real culprit here.

Any questions - let me know. Cheers

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology!

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

Learn More

Forum Discussion

device is failing?

Related Content

Device Manager error CDIL-API-FAIL (RBR750)

Orbi iOS app Device Manager Error Sorry your device list failed to load. Please try again later.

WAX610 Device failed to connect back

WAX610 login fails

R7000 WIFI devices fail to connect - obtaining IP address

NETGEAR Academy

ProSupport for Business