Volume degraded after firmware update to 6.9.5 on RN212

Hopchen

Prodigy

Apr 15, 2019

Hi ccbnz

I took a look at your logs. Analysis below.

=== Overview ===

Your disk configuration is: 3TB x 3TB x 2TB x 3TB x 3TB x 3TB.

This means that the NAS will create two raids due to the different sized disks. As follows:

md127 = 6 x 2TB (raid5)
md126 = 5 x 1TB (raid5)

We can see by the raid config that this is indeed the case. However, notice how sda is a spare in the md126 raid. I suspect this happens because it cannot sync in the disk properly. More on that further down.

md126 : active raid5 sdb4[1] sda4[5](S) sdf4[4] sde4[3] sdd4[2]
      3906483712 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/4] [_UUUU] <<<=== Raid degraded.
      
md127 : active raid5 sda3[0] sdf3[5] sde3[4] sdd3[3] sdc3[2] sdb3[1]
      9743324160 blocks super 1.2 level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]

Your two raids are then "stuck" together by the filesystem in order to create one volume.

Label: '<masked>:data'  uuid: <masked>
	Total devices 2 FS bytes used 970.18GiB
	devid    1 size 9.07TiB used 972.02GiB path /dev/md127
	devid    2 size 3.64TiB used 0.00B path /dev/md126

=== Issue ===

Disk 1 fell out of the md126 raid and was never able to re-join the raid array. The below sequence happens every time you reboot.

[19/04/12 18:01:48] err:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 1 (Internal) changed state from ONLINE to FAILED.
[19/04/12 19:45:48] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.
[19/04/12 19:45:52] notice:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 1 (Internal) changed state from ONLINE to RESYNC.
[19/04/12 23:29:08] notice:volume:LOGMSG_RESILVERCOMPLETE_DEGRADED_VOLUME The resync operation finished on volume data. However, the volume is still degraded.

[19/04/12 23:45:10] notice:system:LOGMSG_SYSTEM_REBOOT The system is rebooting.
[19/04/12 23:46:44] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[19/04/12 23:46:45] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.
[19/04/13 00:24:08] notice:volume:LOGMSG_RESILVERCOMPLETE_DEGRADED_VOLUME The resync operation finished on volume data. However, the volume is still degraded.

A raid sync stopping like this is a safety mechanism because one or more drives are not responding properly during the sync. It is to avoid a potential double disk failure. You disks appear OK-ish with two disks having a small amount of errors.

Device: sdc
Channel: 2 <<<=== Disk 3
ATA Error Count: 9

Device: sdd
Channel: 3 <<<=== Disk 4
ATA Error Count: 1

However, the kernel logs are completely flooded with disk errors from 4 of the disks: 2, 4, 5 and 6. Below is a mere extract.

So, this is why the raid sync fails. It is likely also why disk 1 eventually is simply marked as spare in the md126 raid.

--- Disk 2 ---
Apr 13 20:40:24 kernel: ata2.00: exception Emask 0x50 SAct 0xb0000 SErr 0x280900 action 0x6 frozen
Apr 13 20:40:24 kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Apr 13 20:40:24 kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC }
Apr 13 20:40:24 kernel: ata2.00: failed command: READ FPDMA QUEUED
Apr 13 20:40:24 kernel: ata2.00: cmd 60/40:80:40:19:95/00:00:00:00:00/40 tag 16 ncq 32768 in
                                            res 40/00:a4:00:19:95/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:40:24 kernel: ata2.00: status: { DRDY }
Apr 13 20:40:24 kernel: ata2.00: failed command: READ FPDMA QUEUED
Apr 13 20:40:24 kernel: ata2.00: cmd 60/40:88:40:7f:9b/00:00:00:00:00/40 tag 17 ncq 32768 in
                                            res 40/00:a4:00:19:95/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:40:24 kernel: ata2.00: status: { DRDY }
Apr 13 20:40:24 kernel: ata2.00: failed command: READ FPDMA QUEUED
Apr 13 20:40:24 kernel: ata2.00: cmd 60/80:98:c0:7f:9b/00:00:00:00:00/40 tag 19 ncq 65536 in
                                            res 40/00:a4:00:19:95/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:40:24 kernel: ata2.00: status: { DRDY }
Apr 13 20:40:24 kernel: ata2: hard resetting link

--- Disk 4 ----
Apr 13 20:36:41 kernel: ata4.00: exception Emask 0x50 SAct 0xfe0000 SErr 0x280900 action 0x6 frozen
Apr 13 20:36:41 kernel: ata4.00: irq_stat 0x08000000, interface fatal error
Apr 13 20:36:41 kernel: ata4: SError: { UnrecovData HostInt 10B8B BadCRC }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/40:88:b0:3b:e7/05:00:ee:00:00/40 tag 17 ncq 688128 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/40:90:f0:40:e7/05:00:ee:00:00/40 tag 18 ncq 688128 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/40:98:30:46:e7/05:00:ee:00:00/40 tag 19 ncq 688128 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/40:a0:70:4b:e7/05:00:ee:00:00/40 tag 20 ncq 688128 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/40:a8:b0:50:e7/05:00:ee:00:00/40 tag 21 ncq 688128 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/40:b0:f0:55:e7/05:00:ee:00:00/40 tag 22 ncq 688128 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/00:b8:30:5b:e7/05:00:ee:00:00/40 tag 23 ncq 655360 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4: hard resetting link

--- Disk 5 ----
Apr 13 20:38:33 kernel: ata5.00: exception Emask 0x50 SAct 0x7 SErr 0x200900 action 0x6 frozen
Apr 13 20:38:34 kernel: ata5.00: irq_stat 0x08000000, interface fatal error
Apr 13 20:38:34 kernel: ata5: SError: { UnrecovData HostInt BadCRC }
Apr 13 20:38:34 kernel: ata5.00: failed command: READ FPDMA QUEUED
Apr 13 20:38:34 kernel: ata5.00: cmd 60/80:00:40:10:95/00:00:00:00:00/40 tag 0 ncq 65536 in
                                            res 40/00:0c:00:10:95/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:38:34 kernel: ata5.00: status: { DRDY }
Apr 13 20:38:34 kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Apr 13 20:38:34 kernel: ata5.00: cmd 61/40:08:00:10:95/00:00:00:00:00/40 tag 1 ncq 32768 out
                                            res 40/00:0c:00:10:95/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:38:34 kernel: ata5.00: status: { DRDY }
Apr 13 20:38:34 kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Apr 13 20:38:34 kernel: ata5.00: cmd 61/40:10:c0:0f:95/00:00:00:00:00/40 tag 2 ncq 32768 out
                                            res 40/00:0c:00:10:95/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:38:34 kernel: ata5.00: status: { DRDY }
Apr 13 20:38:34 kernel: ata5: hard resetting link
Apr 13 20:38:34 kernel: do_marvell_9170_recover: ignoring PCI device (8086:2821) at PCI#0
Apr 13 20:38:34 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 13 20:38:34 kernel: ata5.00: configured for UDMA/33
Apr 13 20:38:34 kernel: ata5: EH complete

--- Disk 6 ----
Apr 13 20:36:41 kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 13 20:36:41 kernel: ata6.00: configured for UDMA/33
Apr 13 20:36:41 kernel: ata6: EH complete
Apr 13 20:36:41 kernel: ata6.00: exception Emask 0x50 SAct 0x807fffc SErr 0x280900 action 0x6 frozen
Apr 13 20:36:41 kernel: ata6.00: irq_stat 0x08000000, interface fatal error
Apr 13 20:36:41 kernel: ata6: SError: { UnrecovData HostInt 10B8B BadCRC }
Apr 13 20:36:41 kernel: ata6.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata6.00: cmd 60/18:10:98:a9:e6/02:00:ee:00:00/40 tag 2 ncq 274432 in
                                            res 40/00:94:30:5e:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata6.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata6.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata6.00: cmd 60/40:18:58:a4:e6/05:00:ee:00:00/40 tag 3 ncq 688128 in
                                            res 40/00:94:30:5e:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata6.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata6.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata6.00: cmd 60/40:20:18:9f:e6/05:00:ee:00:00/40 tag 4 ncq 688128 in
                                            res 40/00:94:30:5e:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata6.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata6.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata6.00: cmd 60/40:28:d8:99:e6/05:00:ee:00:00/40 tag 5 ncq 688128 in
                                            res 40/00:94:30:5e:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)

It is of course unlikely that soo many disks are bad even though the disks aren't young anymore. I would be suspect of the chassis here.

I suggest that you take a backup right now. Because md126 is degraded, one more disk being kicked from the raid could leave you in serious trouble.

=== My recommendations ===

1. Take a backup of your data asap.

2. Turn off NAS and test each disk in a PC with WD disk test tool. It can be downloaded from their website.

3. Replace any disks that come out bad.

4. Factory reset the NAS and start over with all healthy disks. Restore data from backups.

5. If issue re-occurs --> replace the NAS. Keep backups at all times!

Cheers

Forum Discussion

Volume degraded after firmware update to 6.9.5 on RN212

Related Content

Volume Data Degraded

readynas RN3220 volume degraded volume resyncing

Volume Data degraded error

RN312 random volume degraded with no indication why

Hope to Get Data & Volume: Volume Degraded --> No Volume

NETGEAR Academy

ProSupport for Business