Volume dead or inactive after balancing - works with readonly

Hello,

my ReadyNAS 104 reports that the Volume is inactive or dead after I performed "balancing disks". The data volume is accessible in read-only mode (boot menu), but I am really concerned about the root cause, since all Disks (3 HDDs as Raid 5) are reported healthy.

Is there anybody who can take a look to the logs (which I would send via PM). I really would appriciate that.

Thanks in advance

Matthias

rn_enthusiast

Jan 12, 2021

Thanks for the logs Matthias1111

The NAS experienced several out of memory conditions which likely caused the issue/crash in the end.
It also seems to be induced by quota calculation. Example below. This happened over and over, by the way.

Jan 10 00:12:18 NAS kernel: Hardware name: Marvell Armada 370/XP (Device Tree)
Jan 10 00:12:18 NAS kernel: [<c0015270>] (unwind_backtrace) from [<c001173c>] (show_stack+0x10/0x18)
Jan 10 00:12:18 NAS kernel: [<c001173c>] (show_stack) from [<c03849d0>] (dump_stack+0x78/0x9c)
Jan 10 00:12:18 NAS kernel: [<c03849d0>] (dump_stack) from [<c00d5e20>] (dump_header+0x4c/0x1b4)
Jan 10 00:12:18 NAS kernel: [<c00d5e20>] (dump_header) from [<c00a09a0>] (oom_kill_process+0xd0/0x45c)
Jan 10 00:12:18 NAS kernel: [<c00a09a0>] (oom_kill_process) from [<c00a10b0>] (out_of_memory+0x310/0x374)
Jan 10 00:12:18 NAS kernel: [<c00a10b0>] (out_of_memory) from [<c00a49d4>] (__alloc_pages_nodemask+0x6e0/0x7dc)
Jan 10 00:12:18 NAS kernel: [<c00a49d4>] (__alloc_pages_nodemask) from [<c00cb4c0>] (__read_swap_cache_async+0x70/0x1a0)
Jan 10 00:12:18 NAS kernel: [<c00cb4c0>] (__read_swap_cache_async) from [<c00cb600>] (read_swap_cache_async+0x10/0x34)
Jan 10 00:12:18 NAS kernel: [<c00cb600>] (read_swap_cache_async) from [<c00cb788>] (swapin_readahead+0x164/0x17c)
Jan 10 00:12:18 NAS kernel: [<c00cb788>] (swapin_readahead) from [<c00bd4fc>] (handle_mm_fault+0x83c/0xc04)
Jan 10 00:12:18 NAS kernel: [<c00bd4fc>] (handle_mm_fault) from [<c0017cb8>] (do_page_fault+0x134/0x2b0)
Jan 10 00:12:18 NAS kernel: [<c0017cb8>] (do_page_fault) from [<c00092b0>] (do_DataAbort+0x34/0xb8)
Jan 10 00:12:18 NAS kernel: [<c00092b0>] (do_DataAbort) from [<c00123fc>] (__dabt_usr+0x3c/0x40)
Jan 10 00:12:18 NAS kernel: Out of memory: Kill process 1113 (mount) score 1 or sacrifice child
Jan 10 00:12:18 NAS kernel: Killed process 1113 (mount) total-vm:5400kB, anon-rss:0kB, file-rss:1764kB
Jan 10 00:12:18 NAS kernel: mount: page allocation failure: order:0, mode:0x2600040
Jan 10 00:12:18 NAS kernel: CPU: 0 PID: 1113 Comm: mount Tainted: P O 4.4.190.armada.1 #1
Jan 10 00:12:18 NAS kernel: Hardware name: Marvell Armada 370/XP (Device Tree)
Jan 10 00:12:18 NAS kernel: [<c0015270>] (unwind_backtrace) from [<c001173c>] (show_stack+0x10/0x18)
Jan 10 00:12:18 NAS kernel: [<c001173c>] (show_stack) from [<c03849d0>] (dump_stack+0x78/0x9c)
Jan 10 00:12:18 NAS kernel: [<c03849d0>] (dump_stack) from [<c00a2570>] (warn_alloc_failed+0xec/0x118)
Jan 10 00:12:18 NAS kernel: [<c00a2570>] (warn_alloc_failed) from [<c00a4a44>] (__alloc_pages_nodemask+0x750/0x7dc)
Jan 10 00:12:18 NAS kernel: [<c00a4a44>] (__alloc_pages_nodemask) from [<c00d0d58>] (allocate_slab+0x88/0x280)
Jan 10 00:12:18 NAS kernel: [<c00d0d58>] (allocate_slab) from [<c00d253c>] (___slab_alloc.constprop.13+0x250/0x35c)
Jan 10 00:12:18 NAS kernel: [<c00d253c>] (___slab_alloc.constprop.13) from [<c00d2828>] (kmem_cache_alloc+0xac/0x168)
Jan 10 00:12:18 NAS kernel: [<c00d2828>] (kmem_cache_alloc) from [<c0306114>] (ulist_alloc+0x1c/0x54)
Jan 10 00:12:18 NAS kernel: [<c0306114>] (ulist_alloc) from [<c03040d0>] (resolve_indirect_refs+0x1c/0x6d4)
Jan 10 00:12:18 NAS kernel: [<c03040d0>] (resolve_indirect_refs) from [<c0304b54>] (find_parent_nodes+0x3cc/0x6b0)
Jan 10 00:12:18 NAS kernel: [<c0304b54>] (find_parent_nodes) from [<c0304eb8>] (btrfs_find_all_roots_safe+0x80/0xfc)
Jan 10 00:12:18 NAS kernel: [<c0304eb8>] (btrfs_find_all_roots_safe) from [<c0304f7c>] (btrfs_find_all_roots+0x48/0x6c)
Jan 10 00:12:18 NAS kernel: [<c0304f7c>] (btrfs_find_all_roots) from [<c03089ec>] (btrfs_qgroup_prepare_account_extents+0x58/0xa0)
Jan 10 00:12:18 NAS kernel: [<c03089ec>] (btrfs_qgroup_prepare_account_extents) from [<c029b714>] (btrfs_commit_transaction+0x49c/0x9b4)
Jan 10 00:12:18 NAS kernel: [<c029b714>] (btrfs_commit_transaction) from [<c0284ac4>] (btrfs_drop_snapshot+0x420/0x6bc)
Jan 10 00:12:18 NAS kernel: [<c0284ac4>] (btrfs_drop_snapshot) from [<c02f6338>] (merge_reloc_roots+0x120/0x220)
Jan 10 00:12:18 NAS kernel: [<c02f6338>] (merge_reloc_roots) from [<c02f7138>] (btrfs_recover_relocation+0x2c8/0x370)
Jan 10 00:12:18 NAS kernel: [<c02f7138>] (btrfs_recover_relocation) from [<c0298f00>] (open_ctree+0x1df0/0x2168)
Jan 10 00:12:18 NAS kernel: [<c0298f00>] (open_ctree) from [<c026f578>] (btrfs_mount+0x458/0x690)
Jan 10 00:12:18 NAS kernel: [<c026f578>] (btrfs_mount) from [<c00dbdc0>] (mount_fs+0x6c/0x14c)
Jan 10 00:12:18 NAS kernel: [<c00dbdc0>] (mount_fs) from [<c00f4490>] (vfs_kern_mount+0x4c/0xf0)
Jan 10 00:12:18 NAS kernel: [<c00f4490>] (vfs_kern_mount) from [<c026ea68>] (mount_subvol+0xf4/0x7ac)
Jan 10 00:12:18 NAS kernel: [<c026ea68>] (mount_subvol) from [<c026f2f4>] (btrfs_mount+0x1d4/0x690)
Jan 10 00:12:18 NAS kernel: [<c026f2f4>] (btrfs_mount) from [<c00dbdc0>] (mount_fs+0x6c/0x14c)
Jan 10 00:12:18 NAS kernel: [<c00dbdc0>] (mount_fs) from [<c00f4490>] (vfs_kern_mount+0x4c/0xf0)
Jan 10 00:12:18 NAS kernel: [<c00f4490>] (vfs_kern_mount) from [<c00f71a4>] (do_mount+0xa30/0xb60)
Jan 10 00:12:18 NAS kernel: [<c00f71a4>] (do_mount) from [<c00f74fc>] (SyS_mount+0x70/0xa0)
Jan 10 00:12:18 NAS kernel: [<c00f74fc>] (SyS_mount) from [<c000ec40>] (ret_fast_syscall+0x0/0x40)

It is a quite well established fact that quotas carry a lot of calculation overhead when deleting snapshots. The RN104 has 512MB of RAM so it is already resource starved and so the act of deleting many snapshots in a row or big snapshots, can tip the unit over the edge.

You can run into a race condition where the filesystem need to update (do btrfs-transactions) but the quota module has hogged all the resources and then you can end up in a limbo. The OS re-install disabled quotas (which I didn't know it would) and that likely allowed the filesystem to actually finish the clean-up. It explains why the OS re-install "fixed" it... It didn't really fix anything - it just disabled quotas and that allowed room for other filesystem transactions to take place.

I would advise to keep a lower amount of snapshots in general on these units. Disabling quotas before you delete snapshots or when running things like a balance, will help keeping the unit afloat. And then just re-enable quotas afterwards.

Forum Discussion

Volume dead or inactive after balancing - works with readonly

12 Replies

Related Content

PRO6 "Remove inactive volumes" after interrupted balance

Armor showing inactive subscription

ReadyNAS RN424 | Inactive Volume + RAID Issue

RN104 Remove inactive volumes Disk 3,4

RN214 btrfs corruption forced readonly

NETGEAR Academy

ProSupport for Business