Forum Discussion

Aspirant

Jan 23, 2019

Volume degraded after firmware update to 6.9.5 on RN212

NAS device was upgraded to the 6.9.5 version this evening and after reboot my drive in slot 1 is now showing as degraded. I rebooted and have power reset with no fix. Ran disk tests also with no luck...

Installation & Upgrade

Hopchen

Prodigy

Apr 15, 2019

Hi ccbnz

I took a look at your logs. Analysis below.

=== Overview ===

Your disk configuration is: 3TB x 3TB x 2TB x 3TB x 3TB x 3TB.

This means that the NAS will create two raids due to the different sized disks. As follows:

md127 = 6 x 2TB (raid5)
md126 = 5 x 1TB (raid5)

We can see by the raid config that this is indeed the case. However, notice how sda is a spare in the md126 raid. I suspect this happens because it cannot sync in the disk properly. More on that further down.

md126 : active raid5 sdb4[1] sda4[5](S) sdf4[4] sde4[3] sdd4[2]
      3906483712 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/4] [_UUUU] <<<=== Raid degraded.
      
md127 : active raid5 sda3[0] sdf3[5] sde3[4] sdd3[3] sdc3[2] sdb3[1]
      9743324160 blocks super 1.2 level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]

Your two raids are then "stuck" together by the filesystem in order to create one volume.

Label: '<masked>:data'  uuid: <masked>
	Total devices 2 FS bytes used 970.18GiB
	devid    1 size 9.07TiB used 972.02GiB path /dev/md127
	devid    2 size 3.64TiB used 0.00B path /dev/md126

=== Issue ===

Disk 1 fell out of the md126 raid and was never able to re-join the raid array. The below sequence happens every time you reboot.

[19/04/12 18:01:48] err:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 1 (Internal) changed state from ONLINE to FAILED.
[19/04/12 19:45:48] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.
[19/04/12 19:45:52] notice:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 1 (Internal) changed state from ONLINE to RESYNC.
[19/04/12 23:29:08] notice:volume:LOGMSG_RESILVERCOMPLETE_DEGRADED_VOLUME The resync operation finished on volume data. However, the volume is still degraded.

[19/04/12 23:45:10] notice:system:LOGMSG_SYSTEM_REBOOT The system is rebooting.
[19/04/12 23:46:44] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[19/04/12 23:46:45] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.
[19/04/13 00:24:08] notice:volume:LOGMSG_RESILVERCOMPLETE_DEGRADED_VOLUME The resync operation finished on volume data. However, the volume is still degraded.

A raid sync stopping like this is a safety mechanism because one or more drives are not responding properly during the sync. It is to avoid a potential double disk failure. You disks appear OK-ish with two disks having a small amount of errors.

Device: sdc
Channel: 2 <<<=== Disk 3
ATA Error Count: 9

Device: sdd
Channel: 3 <<<=== Disk 4
ATA Error Count: 1

However, the kernel logs are completely flooded with disk errors from 4 of the disks: 2, 4, 5 and 6. Below is a mere extract.

So, this is why the raid sync fails. It is likely also why disk 1 eventually is simply marked as spare in the md126 raid.

--- Disk 2 ---
Apr 13 20:40:24 kernel: ata2.00: exception Emask 0x50 SAct 0xb0000 SErr 0x280900 action 0x6 frozen
Apr 13 20:40:24 kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Apr 13 20:40:24 kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC }
Apr 13 20:40:24 kernel: ata2.00: failed command: READ FPDMA QUEUED
Apr 13 20:40:24 kernel: ata2.00: cmd 60/40:80:40:19:95/00:00:00:00:00/40 tag 16 ncq 32768 in
                                            res 40/00:a4:00:19:95/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:40:24 kernel: ata2.00: status: { DRDY }
Apr 13 20:40:24 kernel: ata2.00: failed command: READ FPDMA QUEUED
Apr 13 20:40:24 kernel: ata2.00: cmd 60/40:88:40:7f:9b/00:00:00:00:00/40 tag 17 ncq 32768 in
                                            res 40/00:a4:00:19:95/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:40:24 kernel: ata2.00: status: { DRDY }
Apr 13 20:40:24 kernel: ata2.00: failed command: READ FPDMA QUEUED
Apr 13 20:40:24 kernel: ata2.00: cmd 60/80:98:c0:7f:9b/00:00:00:00:00/40 tag 19 ncq 65536 in
                                            res 40/00:a4:00:19:95/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:40:24 kernel: ata2.00: status: { DRDY }
Apr 13 20:40:24 kernel: ata2: hard resetting link

--- Disk 4 ----
Apr 13 20:36:41 kernel: ata4.00: exception Emask 0x50 SAct 0xfe0000 SErr 0x280900 action 0x6 frozen
Apr 13 20:36:41 kernel: ata4.00: irq_stat 0x08000000, interface fatal error
Apr 13 20:36:41 kernel: ata4: SError: { UnrecovData HostInt 10B8B BadCRC }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/40:88:b0:3b:e7/05:00:ee:00:00/40 tag 17 ncq 688128 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/40:90:f0:40:e7/05:00:ee:00:00/40 tag 18 ncq 688128 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/40:98:30:46:e7/05:00:ee:00:00/40 tag 19 ncq 688128 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/40:a0:70:4b:e7/05:00:ee:00:00/40 tag 20 ncq 688128 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/40:a8:b0:50:e7/05:00:ee:00:00/40 tag 21 ncq 688128 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/40:b0:f0:55:e7/05:00:ee:00:00/40 tag 22 ncq 688128 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata4.00: cmd 60/00:b8:30:5b:e7/05:00:ee:00:00/40 tag 23 ncq 655360 in
                                            res 40/00:bc:30:5b:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata4.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata4: hard resetting link

--- Disk 5 ----
Apr 13 20:38:33 kernel: ata5.00: exception Emask 0x50 SAct 0x7 SErr 0x200900 action 0x6 frozen
Apr 13 20:38:34 kernel: ata5.00: irq_stat 0x08000000, interface fatal error
Apr 13 20:38:34 kernel: ata5: SError: { UnrecovData HostInt BadCRC }
Apr 13 20:38:34 kernel: ata5.00: failed command: READ FPDMA QUEUED
Apr 13 20:38:34 kernel: ata5.00: cmd 60/80:00:40:10:95/00:00:00:00:00/40 tag 0 ncq 65536 in
                                            res 40/00:0c:00:10:95/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:38:34 kernel: ata5.00: status: { DRDY }
Apr 13 20:38:34 kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Apr 13 20:38:34 kernel: ata5.00: cmd 61/40:08:00:10:95/00:00:00:00:00/40 tag 1 ncq 32768 out
                                            res 40/00:0c:00:10:95/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:38:34 kernel: ata5.00: status: { DRDY }
Apr 13 20:38:34 kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Apr 13 20:38:34 kernel: ata5.00: cmd 61/40:10:c0:0f:95/00:00:00:00:00/40 tag 2 ncq 32768 out
                                            res 40/00:0c:00:10:95/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:38:34 kernel: ata5.00: status: { DRDY }
Apr 13 20:38:34 kernel: ata5: hard resetting link
Apr 13 20:38:34 kernel: do_marvell_9170_recover: ignoring PCI device (8086:2821) at PCI#0
Apr 13 20:38:34 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 13 20:38:34 kernel: ata5.00: configured for UDMA/33
Apr 13 20:38:34 kernel: ata5: EH complete

--- Disk 6 ----
Apr 13 20:36:41 kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 13 20:36:41 kernel: ata6.00: configured for UDMA/33
Apr 13 20:36:41 kernel: ata6: EH complete
Apr 13 20:36:41 kernel: ata6.00: exception Emask 0x50 SAct 0x807fffc SErr 0x280900 action 0x6 frozen
Apr 13 20:36:41 kernel: ata6.00: irq_stat 0x08000000, interface fatal error
Apr 13 20:36:41 kernel: ata6: SError: { UnrecovData HostInt 10B8B BadCRC }
Apr 13 20:36:41 kernel: ata6.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata6.00: cmd 60/18:10:98:a9:e6/02:00:ee:00:00/40 tag 2 ncq 274432 in
                                            res 40/00:94:30:5e:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata6.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata6.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata6.00: cmd 60/40:18:58:a4:e6/05:00:ee:00:00/40 tag 3 ncq 688128 in
                                            res 40/00:94:30:5e:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata6.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata6.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata6.00: cmd 60/40:20:18:9f:e6/05:00:ee:00:00/40 tag 4 ncq 688128 in
                                            res 40/00:94:30:5e:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)
Apr 13 20:36:41 kernel: ata6.00: status: { DRDY }
Apr 13 20:36:41 kernel: ata6.00: failed command: READ FPDMA QUEUED
Apr 13 20:36:41 kernel: ata6.00: cmd 60/40:28:d8:99:e6/05:00:ee:00:00/40 tag 5 ncq 688128 in
                                            res 40/00:94:30:5e:e7/00:00:ee:00:00/40 Emask 0x50 (ATA bus error)

It is of course unlikely that soo many disks are bad even though the disks aren't young anymore. I would be suspect of the chassis here.

I suggest that you take a backup right now. Because md126 is degraded, one more disk being kicked from the raid could leave you in serious trouble.

=== My recommendations ===

1. Take a backup of your data asap.

2. Turn off NAS and test each disk in a PC with WD disk test tool. It can be downloaded from their website.

3. Replace any disks that come out bad.

4. Factory reset the NAS and start over with all healthy disks. Restore data from backups.

5. If issue re-occurs --> replace the NAS. Keep backups at all times!

Cheers

ccbnz

Aspirant

Apr 15, 2019

Hi Hopchen . Thanks so much for looking at my log files. Yikes!

I've backed up the NAS and done a factory reset. The NAS seems to be working fine now but looking at the latest kernel log file, there are still ata errors such as:

Apr 16 01:27:23 kernel: do_marvell_9170_recover: ignoring PCI device (8086:2821) at PCI#0
Apr 16 01:27:23 kernel: ata6.00: exception Emask 0x40 SAct 0x0 SErr 0x800800 action 0x6
Apr 16 01:27:23 kernel: ata6.00: irq_stat 0x40000001
Apr 16 01:27:23 kernel: ata6: SError: { HostInt LinkSeq }
Apr 16 01:27:23 kernel: ata6.00: failed command: WRITE DMA
Apr 16 01:27:23 kernel: ata6.00: cmd ca/00:08:40:d8:1a/00:00:00:00:00/e0 tag 10 dma 4096 out
res 51/10:08:40:d8:1a/00:00:00:00:00/e0 Emask 0xc1 (internal error)
Apr 16 01:27:23 kernel: ata6.00: status: { DRDY ERR }
Apr 16 01:27:23 kernel: ata6.00: error: { IDNF }
Apr 16 01:27:23 kernel: ata6: hard resetting link
Apr 16 01:27:23 kernel: do_marvell_9170_recover: ignoring PCI device (8086:2821) at PCI#0
Apr 16 01:27:23 kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 16 01:27:23 kernel: ata6.00: configured for UDMA/33
Apr 16 01:27:23 kernel: ata6: EH complete

This seems to be pretty much happening with all of the drives. These are different errors than those reported in the log you looked at.

Would you expect the above errors normally?

DMA errors would likely be related to the Motherboard right? I did upgrade the CPU and memory on the motherboard a year or so back. But it's been fine since.

chyzm Do you have similar errors in your kernel log file?

Hopchen
Prodigy
Apr 16, 2019
Hi ccbnz

These are not normal no. Given your explanation of the NAS history and changing of parts, I definitely think the issue is in the chassis somewhere. I am wondering if the PSU might be going bad...

The NAS might not have proper drivers for the new CPU, but it looks more like a power issue to me. Could be wrong though.
- ccbnz
  Aspirant
  Apr 16, 2019
  Thanks Hopchen
  
  The NAS seems to be working fine now - albeit with the SATA DMA errors (which may have been happening for ages without my knowledge). I notice that the ATA error count is creeping up a bit on all of the drives - which suggests an interface error. I pulled the unit apart and cleaned the interconnect between the motherboard and the disk backplane board just in case there was a bad connection there.
  
  I guess I'll just keep an eye on it. Do you know if you can buy a compatible power supply these days? The unit still works fine as a NAS and is built to a high quality compared to some of the consumer NAS's.
  - Hopchen
    Prodigy
    Apr 17, 2019
    ccbnz wrote:
    
    Do you know if you can buy a compatible power supply these days? The unit still works fine as a NAS and is built to a high quality compared to some of the consumer NAS's.
    
    I think StephenB or Sandshark would better at answering that one :)

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology!

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

Learn More

Forum Discussion

Volume degraded after firmware update to 6.9.5 on RN212

Related Content

Volume Data Degraded

readynas RN3220 volume degraded volume resyncing

Volume Data degraded error

RN312 random volume degraded with no indication why

Hope to Get Data & Volume: Volume Degraded --> No Volume

NETGEAR Academy

ProSupport for Business