NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.

Forum Discussion

tomatohead1's avatar
tomatohead1
Aspirant
Mar 10, 2019
Solved

Volume problems NAS 314

Hi - I replaced a failed disk yesterday. The system resynced, and then I updated the firmware to 6.9.5 Access to the system is now down, and system webpage tells me "Remove inactive volumes to use ...
  • Hopchen's avatar
    Hopchen
    Mar 10, 2019

    Hi tomatohead1 

     

    Unfortunately, I am not the bearer of good news. The reason that you get the "Remove inactive volumes" error, is because the data volume cannot mount. In your case, it cannot mount because your data raid is not running. As can be seen in the raid config, only the OS raid and the swap raid is running.

    Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
    md1 : active raid10 sda2[0] sdb2[3] sdd2[2] sdc2[1]
    1046528 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU] <<<=== Swap raid
    
    md0 : active raid1 sda1[0] sdd1[4] sdc1[2] sdb1[5]
    4190208 blocks super 1.2 [4/4] [UUUU] <<<=== OS raid
    
    <<<=== Missing data raid (md127)


    You recently replaced disk 2. That should normally be fine in a raid 5 (you can tolerate one disk failure in such a raid). Disk no. 1 and no. 3 does have a few errors on them - 4 ATA Errors on each.

    Device: sda
    Channel: 0 <<<=== Bay 1
    ATA Error Count: 4
    
    Device: sdc
    Channel: 2 <<<=== Bay 3
    ATA Error Count: 4


    This is not a big amount of errors but that is the thing with disk errors... Sometimes, one error is enough.

     

    You replaced disk 2 and the raid sync was started as per normal.

    [19/03/09 01:00:29 EST] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
    [19/03/09 13:26:40 EST] notice:disk:LOGMSG_ADD_DISK Disk Model:TOSHIBA HDWD130 Serial:xxxxxxxx was added to Channel 2 of the head unit.
    [19/03/09 13:26:48 EST] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.

     

    5 hours later disk 3 dropped out and the data raid "died".

    [19/03/09 18:33:05 EST] notice:volume:LOGMSG_HEALTH_VOLUME Volume data health changed from Degraded to Dead.
    [19/03/09 18:34:54 EST] err:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 3 (Internal) changed state from ONLINE to FAILED.

    Just before disk 3 fails, we see these kernel messages about disk 3. This is definitely a dodgy disk.

    Mar 09 18:31:52 kernel: ata3.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
    Mar 09 18:31:52 kernel: ata3.00: irq_stat 0x40000008
    Mar 09 18:31:52 kernel: ata3.00: failed command: READ FPDMA QUEUED
    Mar 09 18:31:52 kernel: ata3.00: cmd 60/40:b8:c0:e5:a5/05:00:d0:00:00/40 tag 23 ncq 688128 in
    res 41/40:40:98:e9:a5/00:05:d0:00:00/00 Emask 0x409 (media error) <F>
    Mar 09 18:31:52 kernel: ata3.00: status: { DRDY ERR }
    Mar 09 18:31:52 kernel: ata3.00: error: { UNC }
    Mar 09 18:31:52 kernel: ata3.00: configured for UDMA/133
    Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 Sense Key : Medium Error [current] [descriptor]
    Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed
    Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 CDB: Read(16) 88 00 00 00 00 00 d0 a5 e5 c0 00 00 05 40 00 00
    Mar 09 18:31:52 kernel: blk_update_request: I/O error, dev sdc, sector 3500534168
    Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096920 on sdc3).
    Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096928 on sdc3).
    Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096936 on sdc3).
    Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096944 on sdc3).
    Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096952 on sdc3).
    Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096960 on sdc3).
    Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096968 on sdc3).
    Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096976 on sdc3).
    Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096984 on sdc3).
    Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096992 on sdc3).
    Mar 09 18:31:52 kernel: ata3: EH complete
    Mar 09 18:31:56 kernel: do_marvell_9170_recover: ignoring PCI device (8086:3a22) at PCI#0
    Mar 09 18:31:56 kernel: ata3.00: exception Emask 0x0 SAct 0x7f60003f SErr 0x0 action 0x0
    Mar 09 18:31:56 kernel: ata3.00: irq_stat 0x40000008
    Mar 09 18:31:56 kernel: ata3.00: failed command: READ FPDMA QUEUED
    Mar 09 18:31:56 kernel: ata3.00: cmd 60/68:a8:98:e9:a5/01:00:d0:00:00/40 tag 21 ncq 184320 in
    res 41/40:68:98:e9:a5/00:01:d0:00:00/00 Emask 0x409 (media error) <F>
    Mar 09 18:31:56 kernel: ata3.00: status: { DRDY ERR }
    Mar 09 18:31:56 kernel: ata3.00: error: { UNC }
    Mar 09 18:31:56 kernel: ata3.00: configured for UDMA/133
    Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 Sense Key : Medium Error [current] [descriptor]
    Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed
    Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 CDB: Read(16) 88 00 00 00 00 00 d0 a5 e9 98 00 00 01 68 00 00
    Mar 09 18:31:56 kernel: blk_update_request: I/O error, dev sdc, sector 3500534168

    As a result, the raid sync stops since a raid 5 cannot operate on 2 devices and the raid is declared "dead" at this point. You have suffered from the classic case of replacing a disk and during raid re-sync another disk in the raid failed (double disk failure). I would not blame the ReadyNAS for this. A raid sync is a strenuous task for the disks and a disk that might have shown just a tiny bit of errors before, can "blow" up during a raid sync. It would be highly advisable to always ensure an up-to-date backup is available, especially before a raid sync (i.e. replacing a disk).

     

    I would estimate that recovery possibilities here are decent. Even though disk 3 is not stable it can still be read by the NAS. The new disk 2 is likely of no use to us as the raid sync wouldn't have fully finished before disk 3 dropped out.
    I reckon that, in order to look at recovery, you would need:
    - Disk 1
    - Clone disk 3 to new healthy disk. The reason for cloning disk 3 is because it has proven not stable at this point.
    - Disk 4

    With those 3 disks one could force-assemble the data raid and hope for the best. You might even need to deal with some minor filesystem issues afterwards. Definitely not for the faint of heart.


    If you have an up-to-date backup or if the data is not important then you can factory reset but ensure that you use 100% healthy disks. I would be very hesitant to use disk 3. Might be good to test all disks with manufacturer's disk-test tool.


    If you do need the data on the other hand, then I would advise to make use of NETGEAR's data recovery service. This will of course carry a fee with it - I believe it is a couple hundred bucks. They should be able to help with disk cloning and raid assembly.

     

    Cheers

     

     

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology! 

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

 

Learn More