Forum Discussion

Aspirant

Mar 10, 2019

Solved

Volume problems NAS 314

Hi - I replaced a failed disk yesterday. The system resynced, and then I updated the firmware to 6.9.5 Access to the system is now down, and system webpage tells me "Remove inactive volumes to use ...

Hopchen

Mar 10, 2019

Hi tomatohead1

Unfortunately, I am not the bearer of good news. The reason that you get the "Remove inactive volumes" error, is because the data volume cannot mount. In your case, it cannot mount because your data raid is not running. As can be seen in the raid config, only the OS raid and the swap raid is running.

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md1 : active raid10 sda2[0] sdb2[3] sdd2[2] sdc2[1]
1046528 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU] <<<=== Swap raid

md0 : active raid1 sda1[0] sdd1[4] sdc1[2] sdb1[5]
4190208 blocks super 1.2 [4/4] [UUUU] <<<=== OS raid

<<<=== Missing data raid (md127)

You recently replaced disk 2. That should normally be fine in a raid 5 (you can tolerate one disk failure in such a raid). Disk no. 1 and no. 3 does have a few errors on them - 4 ATA Errors on each.

Device: sda
Channel: 0 <<<=== Bay 1
ATA Error Count: 4

Device: sdc
Channel: 2 <<<=== Bay 3
ATA Error Count: 4

This is not a big amount of errors but that is the thing with disk errors... Sometimes, one error is enough.

You replaced disk 2 and the raid sync was started as per normal.

[19/03/09 01:00:29 EST] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[19/03/09 13:26:40 EST] notice:disk:LOGMSG_ADD_DISK Disk Model:TOSHIBA HDWD130 Serial:xxxxxxxx was added to Channel 2 of the head unit.
[19/03/09 13:26:48 EST] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.

5 hours later disk 3 dropped out and the data raid "died".

[19/03/09 18:33:05 EST] notice:volume:LOGMSG_HEALTH_VOLUME Volume data health changed from Degraded to Dead.
[19/03/09 18:34:54 EST] err:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 3 (Internal) changed state from ONLINE to FAILED.

Just before disk 3 fails, we see these kernel messages about disk 3. This is definitely a dodgy disk.

Mar 09 18:31:52 kernel: ata3.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Mar 09 18:31:52 kernel: ata3.00: irq_stat 0x40000008
Mar 09 18:31:52 kernel: ata3.00: failed command: READ FPDMA QUEUED
Mar 09 18:31:52 kernel: ata3.00: cmd 60/40:b8:c0:e5:a5/05:00:d0:00:00/40 tag 23 ncq 688128 in
res 41/40:40:98:e9:a5/00:05:d0:00:00/00 Emask 0x409 (media error) <F>
Mar 09 18:31:52 kernel: ata3.00: status: { DRDY ERR }
Mar 09 18:31:52 kernel: ata3.00: error: { UNC }
Mar 09 18:31:52 kernel: ata3.00: configured for UDMA/133
Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 Sense Key : Medium Error [current] [descriptor]
Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed
Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 CDB: Read(16) 88 00 00 00 00 00 d0 a5 e5 c0 00 00 05 40 00 00
Mar 09 18:31:52 kernel: blk_update_request: I/O error, dev sdc, sector 3500534168
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096920 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096928 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096936 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096944 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096952 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096960 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096968 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096976 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096984 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096992 on sdc3).
Mar 09 18:31:52 kernel: ata3: EH complete
Mar 09 18:31:56 kernel: do_marvell_9170_recover: ignoring PCI device (8086:3a22) at PCI#0
Mar 09 18:31:56 kernel: ata3.00: exception Emask 0x0 SAct 0x7f60003f SErr 0x0 action 0x0
Mar 09 18:31:56 kernel: ata3.00: irq_stat 0x40000008
Mar 09 18:31:56 kernel: ata3.00: failed command: READ FPDMA QUEUED
Mar 09 18:31:56 kernel: ata3.00: cmd 60/68:a8:98:e9:a5/01:00:d0:00:00/40 tag 21 ncq 184320 in
res 41/40:68:98:e9:a5/00:01:d0:00:00/00 Emask 0x409 (media error) <F>
Mar 09 18:31:56 kernel: ata3.00: status: { DRDY ERR }
Mar 09 18:31:56 kernel: ata3.00: error: { UNC }
Mar 09 18:31:56 kernel: ata3.00: configured for UDMA/133
Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 Sense Key : Medium Error [current] [descriptor]
Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed
Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 CDB: Read(16) 88 00 00 00 00 00 d0 a5 e9 98 00 00 01 68 00 00
Mar 09 18:31:56 kernel: blk_update_request: I/O error, dev sdc, sector 3500534168

As a result, the raid sync stops since a raid 5 cannot operate on 2 devices and the raid is declared "dead" at this point. You have suffered from the classic case of replacing a disk and during raid re-sync another disk in the raid failed (double disk failure). I would not blame the ReadyNAS for this. A raid sync is a strenuous task for the disks and a disk that might have shown just a tiny bit of errors before, can "blow" up during a raid sync. It would be highly advisable to always ensure an up-to-date backup is available, especially before a raid sync (i.e. replacing a disk).

I would estimate that recovery possibilities here are decent. Even though disk 3 is not stable it can still be read by the NAS. The new disk 2 is likely of no use to us as the raid sync wouldn't have fully finished before disk 3 dropped out.
I reckon that, in order to look at recovery, you would need:
- Disk 1
- Clone disk 3 to new healthy disk. The reason for cloning disk 3 is because it has proven not stable at this point.
- Disk 4

With those 3 disks one could force-assemble the data raid and hope for the best. You might even need to deal with some minor filesystem issues afterwards. Definitely not for the faint of heart.

If you have an up-to-date backup or if the data is not important then you can factory reset but ensure that you use 100% healthy disks. I would be very hesitant to use disk 3. Might be good to test all disks with manufacturer's disk-test tool.

If you do need the data on the other hand, then I would advise to make use of NETGEAR's data recovery service. This will of course carry a fee with it - I believe it is a couple hundred bucks. They should be able to help with disk cloning and raid assembly.

Cheers

Hopchen

Prodigy

Mar 10, 2019

Hi tomatohead1

Well, it looks like more than one disk might have issues. Likely the data raid is not starting due to more than one disk being trouble, in your raid 5 config.

If you want, download the logs and upload to a Google link (or similar) and PM me the link. I can take a look for you.

Cheers

tomatohead1

Aspirant

Mar 10, 2019

Thanks! Link sent...

Hopchen
Prodigy
Mar 10, 2019
Hi tomatohead1

Unfortunately, I am not the bearer of good news. The reason that you get the "Remove inactive volumes" error, is because the data volume cannot mount. In your case, it cannot mount because your data raid is not running. As can be seen in the raid config, only the OS raid and the swap raid is running.

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md1 : active raid10 sda2[0] sdb2[3] sdd2[2] sdc2[1] 1046528 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU] <<<=== Swap raid md0 : active raid1 sda1[0] sdd1[4] sdc1[2] sdb1[5] 4190208 blocks super 1.2 [4/4] [UUUU] <<<=== OS raid <<<=== Missing data raid (md127)

You recently replaced disk 2. That should normally be fine in a raid 5 (you can tolerate one disk failure in such a raid). Disk no. 1 and no. 3 does have a few errors on them - 4 ATA Errors on each.

Device: sda Channel: 0 <<<=== Bay 1 ATA Error Count: 4 Device: sdc Channel: 2 <<<=== Bay 3 ATA Error Count: 4

This is not a big amount of errors but that is the thing with disk errors... Sometimes, one error is enough.

You replaced disk 2 and the raid sync was started as per normal.

[19/03/09 01:00:29 EST] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded. [19/03/09 13:26:40 EST] notice:disk:LOGMSG_ADD_DISK Disk Model:TOSHIBA HDWD130 Serial:xxxxxxxx was added to Channel 2 of the head unit. [19/03/09 13:26:48 EST] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.

5 hours later disk 3 dropped out and the data raid "died".

[19/03/09 18:33:05 EST] notice:volume:LOGMSG_HEALTH_VOLUME Volume data health changed from Degraded to Dead. [19/03/09 18:34:54 EST] err:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 3 (Internal) changed state from ONLINE to FAILED.

Just before disk 3 fails, we see these kernel messages about disk 3. This is definitely a dodgy disk.

Mar 09 18:31:52 kernel: ata3.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 Mar 09 18:31:52 kernel: ata3.00: irq_stat 0x40000008 Mar 09 18:31:52 kernel: ata3.00: failed command: READ FPDMA QUEUED Mar 09 18:31:52 kernel: ata3.00: cmd 60/40:b8:c0:e5:a5/05:00:d0:00:00/40 tag 23 ncq 688128 in res 41/40:40:98:e9:a5/00:05:d0:00:00/00 Emask 0x409 (media error) <F> Mar 09 18:31:52 kernel: ata3.00: status: { DRDY ERR } Mar 09 18:31:52 kernel: ata3.00: error: { UNC } Mar 09 18:31:52 kernel: ata3.00: configured for UDMA/133 Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 Sense Key : Medium Error [current] [descriptor] Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 CDB: Read(16) 88 00 00 00 00 00 d0 a5 e5 c0 00 00 05 40 00 00 Mar 09 18:31:52 kernel: blk_update_request: I/O error, dev sdc, sector 3500534168 Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096920 on sdc3). Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096928 on sdc3). Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096936 on sdc3). Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096944 on sdc3). Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096952 on sdc3). Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096960 on sdc3). Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096968 on sdc3). Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096976 on sdc3). Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096984 on sdc3). Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096992 on sdc3). Mar 09 18:31:52 kernel: ata3: EH complete Mar 09 18:31:56 kernel: do_marvell_9170_recover: ignoring PCI device (8086:3a22) at PCI#0 Mar 09 18:31:56 kernel: ata3.00: exception Emask 0x0 SAct 0x7f60003f SErr 0x0 action 0x0 Mar 09 18:31:56 kernel: ata3.00: irq_stat 0x40000008 Mar 09 18:31:56 kernel: ata3.00: failed command: READ FPDMA QUEUED Mar 09 18:31:56 kernel: ata3.00: cmd 60/68:a8:98:e9:a5/01:00:d0:00:00/40 tag 21 ncq 184320 in res 41/40:68:98:e9:a5/00:01:d0:00:00/00 Emask 0x409 (media error) <F> Mar 09 18:31:56 kernel: ata3.00: status: { DRDY ERR } Mar 09 18:31:56 kernel: ata3.00: error: { UNC } Mar 09 18:31:56 kernel: ata3.00: configured for UDMA/133 Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 Sense Key : Medium Error [current] [descriptor] Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 CDB: Read(16) 88 00 00 00 00 00 d0 a5 e9 98 00 00 01 68 00 00 Mar 09 18:31:56 kernel: blk_update_request: I/O error, dev sdc, sector 3500534168

As a result, the raid sync stops since a raid 5 cannot operate on 2 devices and the raid is declared "dead" at this point. You have suffered from the classic case of replacing a disk and during raid re-sync another disk in the raid failed (double disk failure). I would not blame the ReadyNAS for this. A raid sync is a strenuous task for the disks and a disk that might have shown just a tiny bit of errors before, can "blow" up during a raid sync. It would be highly advisable to always ensure an up-to-date backup is available, especially before a raid sync (i.e. replacing a disk).

I would estimate that recovery possibilities here are decent. Even though disk 3 is not stable it can still be read by the NAS. The new disk 2 is likely of no use to us as the raid sync wouldn't have fully finished before disk 3 dropped out.
I reckon that, in order to look at recovery, you would need:
- Disk 1
- Clone disk 3 to new healthy disk. The reason for cloning disk 3 is because it has proven not stable at this point.
- Disk 4

With those 3 disks one could force-assemble the data raid and hope for the best. You might even need to deal with some minor filesystem issues afterwards. Definitely not for the faint of heart.

If you have an up-to-date backup or if the data is not important then you can factory reset but ensure that you use 100% healthy disks. I would be very hesitant to use disk 3. Might be good to test all disks with manufacturer's disk-test tool.

If you do need the data on the other hand, then I would advise to make use of NETGEAR's data recovery service. This will of course carry a fee with it - I believe it is a couple hundred bucks. They should be able to help with disk cloning and raid assembly.

Cheers
- tomatohead1
  Aspirant
  Mar 10, 2019
  Thanks, Hopchen! You do a great service to the community by making your time and knowledge available to us.
  
  I'll contact Netgear data recovery. Is this something they can do remotely?
  
  I do not have back ups for much of the data - I had set up a Dropbox account for remote backup, but at some point, either with a firmware update, or because I set it up incorrectly, the backup service dropped out.
  
  Can you recommend a setup that is more reliable? Perhaps nothing works better than keeping this NAS and making sure back up is always in place. But if multiple disks can crash without warning, I would think there might be a more reliable solution.
  
  In any event, thanks again for your help. You're great!
  
  Tom
  - Hopchen
    Prodigy
    Mar 11, 2019
    Hi again
    
    No problem. Happy to help.
    
    Yes, the data recovery service is all remotely done. Only in very rare cases would NETGEAR ask for the unit to be sent in. There is no guarantee that they can fix it but it is worth having a discussing with them, I think. Even if they cannot recover for whatever reason then there are even more specialised services out there (a lot more $$) if the data is really really important. However, I think the outlook is decent enough for you here even though I have not seen the unit live. Might be advisable to power off the NAS until you get in contact with support to not add anymore strain onto those disks right now.
    
    I am glad that you are considering keeping better up-to-date backups. I will tag Sandshark and StephenB as they are pretty good with advise on various backup strategies that they use.
    
    Your (indirect) question about raid 5 is a good one. There are many debates from all corners of the Internet of this. Because the weakness of a raid 5 is during the re-sync of the raid when a disk need replacing. In that time period when the NAS is adding the new disk into your raid array, your NAS is vulnerable (in the most strenuous time as well) - as you found out. But then again, raid 5 is still a good overall solution in a 4 bay NAS because it trades redundancy for capacity well.
    
    HDDs are typically very reliable and this issue you have is not a very common occurrence. A more common scenario is that people actually have multiple quite bad disks and then start replacing without realising that they are severely bringing their volume in jeopardy. Many also wait replacing disks until they are totally dead. I think that is a mistake and I my opinion any disk errors should be taken seriously. I personally replace disks with any sign of errors but some might say that is over the top.
    
    I think a good option would be to look at the disk health of all disks prior to replacing a bad disk. You can download the logs and look at the disk_info.log
    
    Here you will see all your disk's current health. Stats to be concerned about are:
    
    ATA Error Count Reallocated Sectors Reallocation Events Spin Retry Count End-to-End Errors Command Timeouts Current Pending Sector Count Uncorrectable Sector Count
    
    I'd say if your disks exhibit any of these errors prior to accepting a new disk into the array - be sure to backup first!

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology!

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

Learn More

Forum Discussion

Volume problems NAS 314

Related Content

Ready NAS 102 app problem

Volume Expansion question

Ready NAS RN10400 volume danneggiato

Ready NAS Duo v2 -- Installation problems

ReadyNAS 3138 - Discard volume(s) - inaccessible volumes help

NETGEAR Academy

ProSupport for Business