Forum Discussion

Aspirant

Mar 10, 2019

Solved

Volume problems NAS 314

Hi - I replaced a failed disk yesterday. The system resynced, and then I updated the firmware to 6.9.5 Access to the system is now down, and system webpage tells me "Remove inactive volumes to use ...

Hopchen

Mar 10, 2019

Hi tomatohead1

Unfortunately, I am not the bearer of good news. The reason that you get the "Remove inactive volumes" error, is because the data volume cannot mount. In your case, it cannot mount because your data raid is not running. As can be seen in the raid config, only the OS raid and the swap raid is running.

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md1 : active raid10 sda2[0] sdb2[3] sdd2[2] sdc2[1]
1046528 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU] <<<=== Swap raid

md0 : active raid1 sda1[0] sdd1[4] sdc1[2] sdb1[5]
4190208 blocks super 1.2 [4/4] [UUUU] <<<=== OS raid

<<<=== Missing data raid (md127)

You recently replaced disk 2. That should normally be fine in a raid 5 (you can tolerate one disk failure in such a raid). Disk no. 1 and no. 3 does have a few errors on them - 4 ATA Errors on each.

Device: sda
Channel: 0 <<<=== Bay 1
ATA Error Count: 4

Device: sdc
Channel: 2 <<<=== Bay 3
ATA Error Count: 4

This is not a big amount of errors but that is the thing with disk errors... Sometimes, one error is enough.

You replaced disk 2 and the raid sync was started as per normal.

[19/03/09 01:00:29 EST] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[19/03/09 13:26:40 EST] notice:disk:LOGMSG_ADD_DISK Disk Model:TOSHIBA HDWD130 Serial:xxxxxxxx was added to Channel 2 of the head unit.
[19/03/09 13:26:48 EST] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.

5 hours later disk 3 dropped out and the data raid "died".

[19/03/09 18:33:05 EST] notice:volume:LOGMSG_HEALTH_VOLUME Volume data health changed from Degraded to Dead.
[19/03/09 18:34:54 EST] err:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 3 (Internal) changed state from ONLINE to FAILED.

Just before disk 3 fails, we see these kernel messages about disk 3. This is definitely a dodgy disk.

Mar 09 18:31:52 kernel: ata3.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Mar 09 18:31:52 kernel: ata3.00: irq_stat 0x40000008
Mar 09 18:31:52 kernel: ata3.00: failed command: READ FPDMA QUEUED
Mar 09 18:31:52 kernel: ata3.00: cmd 60/40:b8:c0:e5:a5/05:00:d0:00:00/40 tag 23 ncq 688128 in
res 41/40:40:98:e9:a5/00:05:d0:00:00/00 Emask 0x409 (media error) <F>
Mar 09 18:31:52 kernel: ata3.00: status: { DRDY ERR }
Mar 09 18:31:52 kernel: ata3.00: error: { UNC }
Mar 09 18:31:52 kernel: ata3.00: configured for UDMA/133
Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 Sense Key : Medium Error [current] [descriptor]
Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed
Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 CDB: Read(16) 88 00 00 00 00 00 d0 a5 e5 c0 00 00 05 40 00 00
Mar 09 18:31:52 kernel: blk_update_request: I/O error, dev sdc, sector 3500534168
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096920 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096928 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096936 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096944 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096952 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096960 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096968 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096976 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096984 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096992 on sdc3).
Mar 09 18:31:52 kernel: ata3: EH complete
Mar 09 18:31:56 kernel: do_marvell_9170_recover: ignoring PCI device (8086:3a22) at PCI#0
Mar 09 18:31:56 kernel: ata3.00: exception Emask 0x0 SAct 0x7f60003f SErr 0x0 action 0x0
Mar 09 18:31:56 kernel: ata3.00: irq_stat 0x40000008
Mar 09 18:31:56 kernel: ata3.00: failed command: READ FPDMA QUEUED
Mar 09 18:31:56 kernel: ata3.00: cmd 60/68:a8:98:e9:a5/01:00:d0:00:00/40 tag 21 ncq 184320 in
res 41/40:68:98:e9:a5/00:01:d0:00:00/00 Emask 0x409 (media error) <F>
Mar 09 18:31:56 kernel: ata3.00: status: { DRDY ERR }
Mar 09 18:31:56 kernel: ata3.00: error: { UNC }
Mar 09 18:31:56 kernel: ata3.00: configured for UDMA/133
Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 Sense Key : Medium Error [current] [descriptor]
Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed
Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 CDB: Read(16) 88 00 00 00 00 00 d0 a5 e9 98 00 00 01 68 00 00
Mar 09 18:31:56 kernel: blk_update_request: I/O error, dev sdc, sector 3500534168

As a result, the raid sync stops since a raid 5 cannot operate on 2 devices and the raid is declared "dead" at this point. You have suffered from the classic case of replacing a disk and during raid re-sync another disk in the raid failed (double disk failure). I would not blame the ReadyNAS for this. A raid sync is a strenuous task for the disks and a disk that might have shown just a tiny bit of errors before, can "blow" up during a raid sync. It would be highly advisable to always ensure an up-to-date backup is available, especially before a raid sync (i.e. replacing a disk).

I would estimate that recovery possibilities here are decent. Even though disk 3 is not stable it can still be read by the NAS. The new disk 2 is likely of no use to us as the raid sync wouldn't have fully finished before disk 3 dropped out.
I reckon that, in order to look at recovery, you would need:
- Disk 1
- Clone disk 3 to new healthy disk. The reason for cloning disk 3 is because it has proven not stable at this point.
- Disk 4

With those 3 disks one could force-assemble the data raid and hope for the best. You might even need to deal with some minor filesystem issues afterwards. Definitely not for the faint of heart.

If you have an up-to-date backup or if the data is not important then you can factory reset but ensure that you use 100% healthy disks. I would be very hesitant to use disk 3. Might be good to test all disks with manufacturer's disk-test tool.

If you do need the data on the other hand, then I would advise to make use of NETGEAR's data recovery service. This will of course carry a fee with it - I believe it is a couple hundred bucks. They should be able to help with disk cloning and raid assembly.

Cheers

tomatohead1

Aspirant

Mar 10, 2019

Thanks! Link sent...

Hopchen

Prodigy

Mar 10, 2019

Hi tomatohead1

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md1 : active raid10 sda2[0] sdb2[3] sdd2[2] sdc2[1]
1046528 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU] <<<=== Swap raid

md0 : active raid1 sda1[0] sdd1[4] sdc1[2] sdb1[5]
4190208 blocks super 1.2 [4/4] [UUUU] <<<=== OS raid

<<<=== Missing data raid (md127)

You recently replaced disk 2. That should normally be fine in a raid 5 (you can tolerate one disk failure in such a raid). Disk no. 1 and no. 3 does have a few errors on them - 4 ATA Errors on each.

Device: sda
Channel: 0 <<<=== Bay 1
ATA Error Count: 4

Device: sdc
Channel: 2 <<<=== Bay 3
ATA Error Count: 4

This is not a big amount of errors but that is the thing with disk errors... Sometimes, one error is enough.

You replaced disk 2 and the raid sync was started as per normal.

[19/03/09 01:00:29 EST] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[19/03/09 13:26:40 EST] notice:disk:LOGMSG_ADD_DISK Disk Model:TOSHIBA HDWD130 Serial:xxxxxxxx was added to Channel 2 of the head unit.
[19/03/09 13:26:48 EST] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.

5 hours later disk 3 dropped out and the data raid "died".

[19/03/09 18:33:05 EST] notice:volume:LOGMSG_HEALTH_VOLUME Volume data health changed from Degraded to Dead.
[19/03/09 18:34:54 EST] err:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 3 (Internal) changed state from ONLINE to FAILED.

Just before disk 3 fails, we see these kernel messages about disk 3. This is definitely a dodgy disk.

Mar 09 18:31:52 kernel: ata3.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Mar 09 18:31:52 kernel: ata3.00: irq_stat 0x40000008
Mar 09 18:31:52 kernel: ata3.00: failed command: READ FPDMA QUEUED
Mar 09 18:31:52 kernel: ata3.00: cmd 60/40:b8:c0:e5:a5/05:00:d0:00:00/40 tag 23 ncq 688128 in
res 41/40:40:98:e9:a5/00:05:d0:00:00/00 Emask 0x409 (media error) <F>
Mar 09 18:31:52 kernel: ata3.00: status: { DRDY ERR }
Mar 09 18:31:52 kernel: ata3.00: error: { UNC }
Mar 09 18:31:52 kernel: ata3.00: configured for UDMA/133
Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 Sense Key : Medium Error [current] [descriptor]
Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed
Mar 09 18:31:52 kernel: sd 2:0:0:0: [sdc] tag#23 CDB: Read(16) 88 00 00 00 00 00 d0 a5 e5 c0 00 00 05 40 00 00
Mar 09 18:31:52 kernel: blk_update_request: I/O error, dev sdc, sector 3500534168
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096920 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096928 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096936 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096944 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096952 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096960 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096968 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096976 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096984 on sdc3).
Mar 09 18:31:52 kernel: md/raid:md127: read error not correctable (sector 3491096992 on sdc3).
Mar 09 18:31:52 kernel: ata3: EH complete
Mar 09 18:31:56 kernel: do_marvell_9170_recover: ignoring PCI device (8086:3a22) at PCI#0
Mar 09 18:31:56 kernel: ata3.00: exception Emask 0x0 SAct 0x7f60003f SErr 0x0 action 0x0
Mar 09 18:31:56 kernel: ata3.00: irq_stat 0x40000008
Mar 09 18:31:56 kernel: ata3.00: failed command: READ FPDMA QUEUED
Mar 09 18:31:56 kernel: ata3.00: cmd 60/68:a8:98:e9:a5/01:00:d0:00:00/40 tag 21 ncq 184320 in
res 41/40:68:98:e9:a5/00:01:d0:00:00/00 Emask 0x409 (media error) <F>
Mar 09 18:31:56 kernel: ata3.00: status: { DRDY ERR }
Mar 09 18:31:56 kernel: ata3.00: error: { UNC }
Mar 09 18:31:56 kernel: ata3.00: configured for UDMA/133
Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 Sense Key : Medium Error [current] [descriptor]
Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed
Mar 09 18:31:56 kernel: sd 2:0:0:0: [sdc] tag#21 CDB: Read(16) 88 00 00 00 00 00 d0 a5 e9 98 00 00 01 68 00 00
Mar 09 18:31:56 kernel: blk_update_request: I/O error, dev sdc, sector 3500534168

With those 3 disks one could force-assemble the data raid and hope for the best. You might even need to deal with some minor filesystem issues afterwards. Definitely not for the faint of heart.

Cheers

tomatohead1
Aspirant
Mar 10, 2019
Thanks, Hopchen! You do a great service to the community by making your time and knowledge available to us.

I'll contact Netgear data recovery. Is this something they can do remotely?

I do not have back ups for much of the data - I had set up a Dropbox account for remote backup, but at some point, either with a firmware update, or because I set it up incorrectly, the backup service dropped out.

Can you recommend a setup that is more reliable? Perhaps nothing works better than keeping this NAS and making sure back up is always in place. But if multiple disks can crash without warning, I would think there might be a more reliable solution.

In any event, thanks again for your help. You're great!

Tom
- Hopchen
  Prodigy
  Mar 11, 2019
  Hi again
  
  No problem. Happy to help.
  
  Yes, the data recovery service is all remotely done. Only in very rare cases would NETGEAR ask for the unit to be sent in. There is no guarantee that they can fix it but it is worth having a discussing with them, I think. Even if they cannot recover for whatever reason then there are even more specialised services out there (a lot more $$) if the data is really really important. However, I think the outlook is decent enough for you here even though I have not seen the unit live. Might be advisable to power off the NAS until you get in contact with support to not add anymore strain onto those disks right now.
  
  I am glad that you are considering keeping better up-to-date backups. I will tag Sandshark and StephenB as they are pretty good with advise on various backup strategies that they use.
  
  Your (indirect) question about raid 5 is a good one. There are many debates from all corners of the Internet of this. Because the weakness of a raid 5 is during the re-sync of the raid when a disk need replacing. In that time period when the NAS is adding the new disk into your raid array, your NAS is vulnerable (in the most strenuous time as well) - as you found out. But then again, raid 5 is still a good overall solution in a 4 bay NAS because it trades redundancy for capacity well.
  
  HDDs are typically very reliable and this issue you have is not a very common occurrence. A more common scenario is that people actually have multiple quite bad disks and then start replacing without realising that they are severely bringing their volume in jeopardy. Many also wait replacing disks until they are totally dead. I think that is a mistake and I my opinion any disk errors should be taken seriously. I personally replace disks with any sign of errors but some might say that is over the top.
  
  I think a good option would be to look at the disk health of all disks prior to replacing a bad disk. You can download the logs and look at the disk_info.log
  
  Here you will see all your disk's current health. Stats to be concerned about are:
  
  ATA Error Count Reallocated Sectors Reallocation Events Spin Retry Count End-to-End Errors Command Timeouts Current Pending Sector Count Uncorrectable Sector Count
  
  I'd say if your disks exhibit any of these errors prior to accepting a new disk into the array - be sure to backup first!
- StephenB
  Guru - Experienced User
  Mar 11, 2019
  tomatohead1 wrote:
  
  Can you recommend a setup that is more reliable? Perhaps nothing works better than keeping this NAS and making sure back up is always in place. But if multiple disks can crash without warning, I would think there might be a more reliable solution.
  
  To clarify - the vulnerability is in RAID itself, so any NAS from any vendor running single redundancy RAID has exactly the same issue.
  
  RAID-6 can handle two failures, so it can tolerate having a second disk fail during the raid resync. However, there are three costs:
  
  the write performance is slower
  
  the overall volume size is smaller
  
  expansion is more expensive, because all 4 disks need to be the same size
  
  As far as the risks go - one factor here is that bad sectors aren't detected until you read or write to them. In many cases (certainly on my own NAS) most of the files aren't accessed very often. So it can take a long time to discover that there is a problem.
  
  OS-6 has several maintenance functions that can be run on a schedule - and two of them serve as good disk diagnostics. One is the disk test, and the other is the btrfs scrub. Personally I run each the maintenance tasks once every three months - and one reason is that I want to detect disk failures well before they get critical.
  
  Hopchen wrote:
  
  I am glad that you are considering keeping better up-to-date backups. I will tag @Sandsharkand @StephenB as they are pretty good with advise on various backup strategies that they use.
  
  There are four general considerations that apply to all backup strategies:
  
  One thing to consider is that sometimes the backup device itself can fail. Sometimes I've needed to access a backup, and discovered that the backup couldn't be read. A related issue is that the system might fail in the middle of a backup - and that can result in a corrupted backup. So my own approach is to have three copies of everything I care about (including the original). I haven't lost data since I began that approach (long before I had a NAS).
  
  The second thing is deciding "what you care about". Obviously any personal content (photos, etc) is irreplaceable. There's other content that could be recovered with effort (ripped movies or Music CDs for example). And perhaps content that is fairly easily replaced. But if you want to back up selectively, you need to be disciplined about keeping the backed up content separate from the unbacked up content. Personally we're not going to be so careful. So my own approach is to back up the entire NAS. It's just easier.
  
  The third thing is sorting out what threats to your data you will protect against. Common threats are:
  
  Disk failures
  
  NAS failures
  
  User error (accidental deletion)
  
  Malware/virus - e.g. ransomware
  
  Physical threat - fire, flood, theft, nearby lightening strikes
  
  Protecting against physical threats requires keeping at least some backups off-site. Protection against ransomware also requires some isolation of the backup.
  
  Finally, it's important to distinguish backup from two-way sync. Two-way sync isn't a backup - in fact it increases the risk of data loss. That's because deletions or modifications to any copy proprogate to all the others.
  
  Now on to specific options:
  
  The cheapest backup method is to back up to USB disks. Ideally you would use at least two rotating USB disks. That protects against a failure of one of the USB drives. You can protect against physical threats and ransomware by keeping one backup off-site - perhaps at a friend's house or in a safe deposit box. You'd need to refresh that periodically, but perhaps not as often as the on-site backup. If you use USB backup, make sure you format the drives so you can access the files from your PCs. Don't use a file format that you can only access via the NAS. Also, I suggest making a full backup periodically - don't just do incremental for years. The full backup can uncover failures in the USB drive that otherwise would be missed. The main disadvantage to USB backup is that you end up handling the disks fairly frequently - so it's not fully automated.
  
  Another method that can be cost effective is Cloud Backup. You do need to be careful here, as many cloud storage options are two-way sync, and not true backup. One advantage is that Cloud Backup is off-site, so it addresses physical vulnerability. Many cloud services also provide reasonable protection from ransomware. There are two disadvantages - it is pretty slow (both backing up and recovering), and it might not be fully reliable.
  
  The most expensive (but most convenient) method is to back up the data to another NAS. This can be fully automated, and you can expand the backup volume as you expand the main NAS. You can get some protection from ransomware if you only enable rsync on the backup NAS (since the PCs won't be able to directly infect it). If a friend cooperates, you can even place a backup NAS off-site.
  
  What I personally do:
  
  My main NAS is an RN526x. I have snapshots enabled with 3 months retention, which gives me reasonable protection against user error. I back this up using daily backup jobs to two other local NAS - a Pro-6, and an RN524x. That gives me three local copies of everything. All the NAS are protected by UPS, which eliminates the possibility of losing data due to an unclean shutdown and offers some protection from power surges. The RN524x only has rsync enabled, and is on a power schedule - so it is off when it isn't making backups. That provides some ransomware protection. I'll likely do the same with the Pro-6, I just haven't gotten to it yet.
  
  In addition, I run CrashPlan (Cloud) Backup on a PC that has the RN526x data volume mapped to a drive letter. I'm only backing up the NAS, so I remain within the terms of service. This gives me disaster recovery from the physical threats. I've been using Crashplan for some years now - it's been generally reliable, and I find it affordable ($10 per month). However, I still see some risk in relying only on a cloud backup service.
  
  That said, the addition of CrashPlan actually gives me four copies of everything - so at some point I could drop one of the ReadyNAS backups. However, there's no need to consider that until one of the NAS fails.
  
  Using two backup NAS is a fairly expensive, but in my case it developed over time. Initially I was backing up my NAS to dedicated hard drives in two desktop PCs - similar to using USB drives. As I outgrew NAS, I'd repurpose them as backup NAS - backing up at least some of the shares of the new main NAS to the older one. For a while I was distributing share backups across multiple older NAS and PC hard drives, as I didn't enough storage in any single device to back up the full NAS. Eventually I ended up with two backup NAS that are large enough to back up the main NAS. I guess my main point here is that you should grow your backup solution as you grow your storage.
  - Hopchen
    Prodigy
    Mar 11, 2019
    Thanks for the awesome detailed reply StephenB
    
    I hope that tomatohead1 can use some of that advise for future safe-guarindg of data :)