RN214 file system read-only (again)

edkedk · ‎2019-06-17

It seems I'm extremely unlucky with this device because this is not the first time a file system error appears (cf. https://community.netgear.com/t5/Using-your-ReadyNAS-in-Business/RN204-repeated-file-system-error/m-... )

RN214 (firmware v6.9.5 Hotfix1) with 4x8TB WD Purple HDDs, used for backing up some servers in an AD environment. About half of the volume is filled. Balance, defrag, scrub scheduled weekly.

Then the file system went read-only. The web interface shows this:

Jun 17, 2019 04:00:01 AM Volume: Scrub started for volume data.
Jun 17, 2019 03:00:01 AM Volume: Defragmentation complete for volume data.
Jun 17, 2019 03:00:01 AM Volume: Defragmentation started for volume data.
Jun 17, 2019 02:00:01 AM Volume: Balance complete for volume data.
Jun 17, 2019 02:00:01 AM Volume: Balance started for volume data.
Jun 16, 2019 10:13:58 PM Volume: The volume data encountered an error and was made read-only. It is recommended to backup your data.
Jun 11, 2019 10:22:24 AM Volume: Scrub completed for volume data'.
Jun 10, 2019 04:00:01 AM Volume: Scrub started for volume data.
Jun 10, 2019 03:53:36 AM Volume: Defragmentation complete for volume data.
Jun 10, 2019 03:00:01 AM Volume: Defragmentation started for volume data.

In systemd-journal.log I see a lot of entries like this:

Jun 16 22:12:20 hubu004 kernel: ------------[ cut here ]------------
Jun 16 22:12:20 hubu004 kernel: WARNING: CPU: 3 PID: 6743 at fs/btrfs/disk-io.c:541 btree_csum_one_bio+0x94/0xd8()
Jun 16 22:12:20 hubu004 kernel: Modules linked in: vpd(PO)
Jun 16 22:12:20 hubu004 kernel: CPU: 3 PID: 6743 Comm: kworker/u8:2 Tainted: P W O 4.4.157.alpine.1 #1
Jun 16 22:12:20 hubu004 kernel: Hardware name: Annapurna Labs Alpine
Jun 16 22:12:20 hubu004 kernel: Workqueue: btrfs-worker btrfs_worker_helper
Jun 16 22:12:20 hubu004 kernel: [<c0014690>] (unwind_backtrace) from [<c0011590>] (show_stack+0x10/0x14)
Jun 16 22:12:20 hubu004 kernel: [<c0011590>] (show_stack) from [<c035d294>] (dump_stack+0x7c/0x9c)
Jun 16 22:12:20 hubu004 kernel: [<c035d294>] (dump_stack) from [<c0090d98>] (warn_slowpath_common+0x80/0xac)
Jun 16 22:12:20 hubu004 kernel: [<c0090d98>] (warn_slowpath_common) from [<c001e700>] (warn_slowpath_null+0x18/0x20)
Jun 16 22:12:20 hubu004 kernel: [<c001e700>] (warn_slowpath_null) from [<c02787e0>] (btree_csum_one_bio+0x94/0xd8)
Jun 16 22:12:20 hubu004 kernel: [<c02787e0>] (btree_csum_one_bio) from [<c0277868>] (run_one_async_start+0x34/0x44)
Jun 16 22:12:20 hubu004 kernel: [<c0277868>] (run_one_async_start) from [<c02b55ec>] (btrfs_worker_helper+0xec/0x1ac)
Jun 16 22:12:20 hubu004 kernel: [<c02b55ec>] (btrfs_worker_helper) from [<c0031c44>] (process_one_work+0x1d4/0x30c)
Jun 16 22:12:20 hubu004 kernel: [<c0031c44>] (process_one_work) from [<c0032b30>] (worker_thread+0x2cc/0x440)
Jun 16 22:12:20 hubu004 kernel: [<c0032b30>] (worker_thread) from [<c0036c44>] (kthread+0xf4/0x104)
Jun 16 22:12:20 hubu004 kernel: [<c0036c44>] (kthread) from [<c000e920>] (ret_from_fork+0x14/0x34)
Jun 16 22:12:20 hubu004 kernel: ---[ end trace f005209bdac8c6a3 ]---

Then this:

Jun 16 22:12:37 hubu004 kernel: BTRFS: error (device md127) in btrfs_commit_transaction:2241: errno=-5 IO failure (Error while writing out transaction)
Jun 16 22:12:37 hubu004 kernel: BTRFS info (device md127): forced readonly
Jun 16 22:12:37 hubu004 kernel: BTRFS warning (device md127): Skipping commit of aborted transaction.
Jun 16 22:12:37 hubu004 kernel: BTRFS: error (device md127) in cleanup_transaction:1864: errno=-5 IO failure
Jun 16 22:12:37 hubu004 kernel: BTRFS info (device md127): delayed_refs has NO entry

Finally hundreds of lines of this:

Jun 17 08:52:36 hubu004 kernel: BTRFS critical (device md127): unable to find logical 764401909760 len 4096

This means a file system crash, right? Any options other than to reset the device and rebuild the volume losing all data? I do have secondary backup, but to be honest, I am fed up with a file system crash every few months!

StephenB · ‎2019-06-17

Did you replace the disk that generated the errors the last time?

edkedk · ‎2019-06-17

Some weeks ago I replaced a disk that started to develop bad sectors.

StephenB · ‎2019-06-17

@edkedk wrote:
Some weeks ago I replaced a disk that started to develop bad sectors.

Ok. And that resynced ok?

Perhaps enable ssh, and enter

# smartctl -x /dev/sda
# smartctl -x /dev/sdb
# smartctl -x /dev/sdc
# smartctl -x /dev/sdd

and look for saved errors for the drives.

For example, something like this:

Error 12 [11] occurred at disk power-on lifetime: 36166 hours (1506 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 0c 27 df 40 40 00  Error: UNC at LBA = 0x0c27df40 = 203939648

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 80 00 c8 00 00 0c 27 df 40 40 08  1d+08:15:46.204  READ FPDMA QUEUED
  60 00 08 00 c0 00 01 06 34 36 98 40 08  1d+08:15:46.163  READ FPDMA QUEUED
  60 00 80 00 b8 00 00 0c 27 e4 40 40 08  1d+08:15:46.146  READ FPDMA QUEUED
  60 00 80 00 b0 00 00 0c 27 e9 40 40 08  1d+08:15:46.123  READ FPDMA QUEUED
  60 00 80 00 a8 00 00 0c 27 ee 40 40 08  1d+08:15:46.094  READ FPDMA QUEUED

edkedk · ‎2019-06-17

Yes, resync completed without any error.

edkedk · ‎2019-06-17

Actually, the error I mentioned in the older thread I linked in the first post happened on the other RN214 device we have in the network...

The disk was replaced in the one this thread is about.

edkedk · ‎2019-06-18

I have run the smartctl commands. On two disk there are 1 and 3 reallocated sectors. On sdd there is this error log:

Error 1 [0] occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  84 -- 41 00 00 00 00 00 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 78 00 00 00 00 01 b8 40 08     00:00:32.826  READ FPDMA QUEUED
  60 00 08 00 70 00 00 00 00 01 b0 40 08     00:00:32.826  READ FPDMA QUEUED
  60 00 08 00 68 00 00 00 00 01 a8 40 08     00:00:32.825  READ FPDMA QUEUED
  60 00 08 00 60 00 00 00 00 01 a0 40 08     00:00:32.825  READ FPDMA QUEUED
  60 00 08 00 58 00 00 00 00 01 98 40 08     00:00:32.825  READ FPDMA QUEUED

StephenB · ‎2019-06-18

@edkedk wrote:

I have run the smartctl commands. On two disk there are 1 and 3 reallocated sectors. On sdd there is this error log:
Error 1 [0] occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
Error: ICRC, ABRT at LBA = 0x00000000 = 0
  

It looks like that happened when the disk was initially powered up. Did you put it into the NAS right away, or did you connect it to a PC and test it first?

It's related to the communication between the NAS (or computer) and the drive, so it could simply be a poor connection when first plugged in. I don't think this error is concerning (and don't see how it would make your system read-only, since it happened before the drive was added to the array).

The other drives with the reallocated sectors might be part of the puzzle though, and probably should be tested.

RN214 file system read-only (again)

RN214 file system read-only (again)

Re: RN214 file system read-only (again)

Re: RN214 file system read-only (again)

Re: RN214 file system read-only (again)

Re: RN214 file system read-only (again)

Re: RN214 file system read-only (again)

Re: RN214 file system read-only (again)

Re: RN214 file system read-only (again)