Re: RN214 data degraded even after hard disk replacement

azees · ‎2021-12-09

Hello all. I have an RN214 for several years now, comprising of 4 HDDs in one volume. Tray 1 and 2 have 2 4TB HDDs in a RAID1 conformation and Trays 3 and 4 have 2 2TB HDDs also in a RAID1 conformation. All disks are WD RED at 5400rpm with 64MB cache. Two weeks ago, the 4TB HDD in tray 1 died and the system is presenting on its LCD "DATA DEGRADED" ever since. I replaced the failed HDD with a similar one, a 4TB WD RED at 5400rpm with 256MB cache. At some point I got the notice that the system is resyncrhonising and the proccess got up to nearly 35%. Since then nothing constructive happened. The LCD keeps blinking "DATA DEGRADED". Am I doing something wrong? I thought that the resyncronisation proccess should be an automatic procedure after the installation of the new HDD. Is there anything more I ought to do in order for my system to obtain a heathy status once again without losing my data? Thank you in advance.

StephenB · ‎2021-12-10

@azees wrote:

I replaced the failed HDD with a similar one, a 4TB WD RED at 5400rpm with 256MB cache.

Unfortunately the WD40EFAX is an SMR drive, not CMR. All the current WD Reds are SMR - the CMR models are now all in the WD Red Plus line. Several folks here have reported issues with the SMR models in OS-6 NAS (similar issues have been reported with competing NAS running ZFS or BTRFS file systems).

So the new drive might be part of the problem. If it's not too late to exchange it with the seller for a Red Plus (WD40EFRX), I suggest doing that.

@azees wrote:

Is there anything more I ought to do in order for my system to obtain a healthy status once again without losing my data?

Your data is at risk. RAID isn't enough to keep your data safe (many posters here have found that out the hard way). And the degraded volume means you don't have RAID protection right now anyway. If you don't have a backup plan in place for your data, then I recommend taking care of that. Drives (and ReadyNAS) can fail at any time.

@azees wrote:

At some point I got the notice that the system is resynchronising and the process got up to nearly 35%.

I thought that the resynchronisation proccess should be an automatic procedure after the installation of the new HDD.

It is, if you are running XRAID - and in your case the process did start automatically. If you look on the volume page in the NAS web ui, is the process still being shown as running? If not, is there any indication on the log page that the process failed?

Please download the full log zip file, and post the contents of mdstat.log (copy/paste in your reply).

azees · ‎2021-12-10

Thank you so much for your reply. The logs do show a resync failure. I paste the content of the "mdstat" file as you asked:

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md126 : active raid1 sdb3[1] sda3[2](S)
3902168832 blocks super 1.2 [2/1] [_U]

md127 : active raid1 sdc3[0] sdd3[1]
1948662784 blocks super 1.2 [2/2] [UU]
bitmap: 0/15 pages [0KB], 65536KB chunk

md1 : active raid10 sda2[0] sdd2[3] sdc2[2] sdb2[1]
1044480 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]

md0 : active raid1 sdc1[0] sda1[5] sdb1[3] sdd1[4]
4190208 blocks super 1.2 [4/4] [UUUU]

unused devices: <none>
/dev/md/0:
Version : 1.2
Creation Time : Sat Jul 12 05:25:03 2014
Raid Level : raid1
Array Size : 4190208 (4.00 GiB 4.29 GB)
Used Dev Size : 4190208 (4.00 GiB 4.29 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent

Update Time : Fri Dec 10 21:09:35 2021
State : active
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0

Consistency Policy : unknown

Name : 09e0c79d:0 (local to host 09e0c79d)
UUID : a39f9642:f027c4cd:e6c78a23:9b5a85f6
Events : 555005

Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
4 8 49 1 active sync /dev/sdd1
3 8 17 2 active sync /dev/sdb1
5 8 1 3 active sync /dev/sda1
/dev/md/1:
Version : 1.2
Creation Time : Sat Dec 4 21:39:25 2021
Raid Level : raid10
Array Size : 1044480 (1020.00 MiB 1069.55 MB)
Used Dev Size : 522240 (510.00 MiB 534.77 MB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent

Update Time : Fri Dec 10 21:08:20 2021
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0

Layout : near=2
Chunk Size : 512K

Consistency Policy : unknown

Name : 09e0c79d:1 (local to host 09e0c79d)
UUID : 3d015c59:09fa392c:49a1f6cf:f46875ac
Events : 19

Number Major Minor RaidDevice State
0 8 2 0 active sync set-A /dev/sda2
1 8 18 1 active sync set-B /dev/sdb2
2 8 34 2 active sync set-A /dev/sdc2
3 8 50 3 active sync set-B /dev/sdd2
/dev/md/data-0:
Version : 1.2
Creation Time : Sat Jul 12 05:25:04 2014
Raid Level : raid1
Array Size : 1948662784 (1858.39 GiB 1995.43 GB)
Used Dev Size : 1948662784 (1858.39 GiB 1995.43 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Fri Dec 10 21:09:23 2021
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

Consistency Policy : unknown

Name : 09e0c79d:data-0 (local to host 09e0c79d)
UUID : 98d23c1b:880010b6:9d3e98e8:01e17cf8
Events : 27876

Number Major Minor RaidDevice State
0 8 35 0 active sync /dev/sdc3
1 8 51 1 active sync /dev/sdd3
/dev/md/data-1:
Version : 1.2
Creation Time : Tue Jul 28 10:56:21 2015
Raid Level : raid1
Array Size : 3902168832 (3721.40 GiB 3995.82 GB)
Used Dev Size : 3902168832 (3721.40 GiB 3995.82 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent

Update Time : Fri Dec 10 21:09:23 2021
State : clean, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1

Consistency Policy : unknown

Name : 09e0c79d:data-1 (local to host 09e0c79d)
UUID : 3bec5d68:ea24fbde:1034161c:4cd3ed69
Events : 133659

Number Major Minor RaidDevice State
- 0 0 0 removed
1 8 19 1 active sync /dev/sdb3

2 8 3 - spare /dev/sda3

StephenB · ‎2021-12-13

@azees wrote:

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md126 : active raid1 sdb3[1] sda3[2](S)
3902168832 blocks super 1.2 [2/1] [_U]

Number Major Minor RaidDevice State
- 0 0 0 removed
1 8 19 1 active sync /dev/sdb3

2 8 3 - spare /dev/sda3

sda (normally disk 1) has been marked as a spare for some reason.

Can you give us more details on why the resync failed? you might need to look for disk errors in system.log and kernel.log at around the time of the resync failure message.

I do recommend backing up the data on this volume, as it is at risk.

azees · ‎2021-12-13

Hello and thank you very much for your time. I understand your advice and I realise that my data are not safe at the moment, but the reason I am using the RN214 is to take backup of my personal files at home. I cannot proccess any procedure in which I can take backup of my backup, without spending a whole lot of money on an insanely expensive external disk, with a capacity of 6 TB or more. I believe that my NAS should do its job and restore my RAID1 configuration so as for my data to be in one more copy. I am about to change my newly purchased SMR HDD to a CMR HDD in hope that the resynchronisation process will be completed successfuly. The resynchronisation process took place from the 4th until the 5th of December at the most. I am giving you the corresponding files, system.log and kernel.log, that you asked, since their content isn't small enough to be pasted in this text editor. Firstly, I am attaching the systel.log file in this post, and the kernel.log in the next one in pdf format. Thank you so much.

azees · ‎2021-12-13

The kernel.log file.

StephenB · ‎2021-12-14

@azees wrote:

the reason I am using the RN214 is to take backup of my personal files at home.

If it is a backup (no files only on the NAS), then there should be no concern about data loss, as you can back them up again.

There are a flood errors for sda on Dec 4, but these are likely the old disk. The WD40EFAX is detected after the reboot at 21:29. About 45 minutes later, we start seeing a flood of read errors from sdb. These appear to have stopped the resync (and not surprisingly resulted in btrfs errors).

So you appear to have developed two failed disks. As an aside, there is a disk test function in the volume settings wheel, and this can be set up to run on a schedule. It would be good to set up a schedule all the maintenance functions. I run one each month (on each volume).

Dec 04 15:58:18 AZ‐NetGearRN214 kernel: Buffer I/O error on dev sda3, logical block 3902299841, async page read

Dec 04 16:02:39 AZ‐NetGearRN214 systemd‐shutdown[1]: Sending SIGTERM to
remaining processes...
‐‐ Reboot ‐‐
Dec 04 21:29:53 AZ‐NetGearRN214 kernel: Booting Linux on physical CPU 0x0

Dec 04 21:29:53 AZ‐NetGearRN214 kernel: ata1.00: ATA‐10: WDC WD40EFAX‐68JH4N1,83.00A83, max UDMA/133

Dec 04 22:15:39 AZ‐NetGearRN214 kernel: md/raid1:md126: sdb: unrecoverable I/O read error for block 17811584

Dec 05 18:41:25 AZ‐NetGearRN214 kernel: blk_update_request: I/O error, dev sdb,sector 89811392
Dec 05 18:41:25 AZ‐NetGearRN214 kernel: md/raid1:md126: sdb: unrecoverable I/O read error for block 80112000

Dec 08 06:26:29 AZ‐NetGearRN214 kernel: blk_update_request: I/O error, dev sdb, sector 89883504
Dec 08 06:26:13 AZ‐NetGearRN214 kernel: BTRFS error (device md126): bdev /dev/md126 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0

Also, you appear to have set up alert emails incorrectly. There are quite a few alerts that are queued up to send. Fixing this might give you some more information on issues detected that have been rotated out of the logs.

Dec 10 21:01:46 AZ‐NetGearRN214 msmtpq[22296]: 49 more mails from queue failed to send with rc = 77

As far as what to do:

Offload what data you need (as much as possible) from the existing volume.
You need to purchase a replacement for sdb (in addition to exchanging the WD40EFAX) Then destroy this volume, and create a new one with good disks.

@azees wrote:

I cannot proccess any procedure in which I can take backup of my backup, without spending a whole lot of money on an insanely expensive external disk, with a capacity of 6 TB or more. I believe that my NAS should do its job and restore my RAID1 configuration so as for my data to be in one more copy.

Given the problems with sdb, you are facing at least some data loss on this volume. Resync with the current sdb will probably fail, and even if it succeeds there will be some file system issues and data loss). That is why I recommended creating a new volume above.

Not sure what you consider "insane", and of course the prices are quite different in different countries. Ultimately, it's a question of what your data is worth to you. A suitable 6 TB external drive can be purchased for about 100 USD in the US.