Disk fail in X-RAID2, after sync half my files are gone!

rabidh · ‎2018-04-23

Hi, I'm on a ReadyNAS NV+ v2, with RAIDiator 5.3.11.

I had it configured for X-RAID2 with 3x 2GB drives and one older 512GB drive. A few days ago the 512GB drive failed, leaving the array unprotected. No big deal - I stuck a 2GB drive in instead of the failed drive, and left it to sync up.

It did leave me with the ominous message:

RAID sync finished on volume C. The array is still in degraded mode, however. This can be caused by a disk sync failure or failed disks in a multi-parity disk array.

But I figured that wasn't too big of a deal - all the hard disks were still lit up green in raidiator and everything else seemed ok. However this morning when I looked half my files were gone! No warning emails or anything.

Before the sync, all my files were there, as I checked and managed to back a few of them up. Looking at the logs I now see pages of:

Apr 23 04:22:13 nv kernel: EXT4-fs error (device dm-0): ext4_find_entry: reading directory #3969027 offset 0
Apr 23 04:22:13 nv kernel: EXT4-fs error (device dm-0): __ext4_get_inode_loc: unable to read inode block - inode=3048192, block=48758863
Apr 23 04:22:13 nv kernel: EXT4-fs error (device dm-0): __ext4_get_inode_loc: unable to read inode block - inode=1051475, block=16777429
Apr 23 04:22:13 nv kernel: EXT4-fs error (device dm-0): __ext4_get_inode_loc: unable to read inode block - inode=1057490, block=16777805
Apr 23 04:22:13 nv kernel: EXT4-fs error (device dm-0): __ext4_get_inode_loc: unable to read inode block - inode=1057491, block=16777805

in system.log.

What can I do? Is there a way to recover these files? I thought this was the whole point of having a NAS.

StephenB · ‎2018-04-24

@rabidh wrote:

It seems particularly unlucky that the replacement drive I put in was faulty. Having just read into it a bit, I wasn't aware that in most RAID systems if one copy of the data becomes corrupt then even though it is duplicated ...

In your case your NAS is using RAID-5. RAID-5 doesn't duplicate your data. Rather it uses parity blocks that allow it to reconstruct data when something is missing.

Putting this in mathematical terms: Imagine a 4-disk RAID-5 array. If disks 1,2, and 3 have A, B, and C data blocks at sector N, then the fourth disk would have P=A+B+C in that sector. (It doesn't use normal addition, but does something else that has the same effect). Then if the disk 3 is replaced, the NAS reconstructs C using P-A-B.

This only works if the remaining disks can all be read (and when all have the correct data). If a disk can't be read duiring reconstruction, then the reconstruction fails (and the NAS knows that). If a disk is read, but gives the wrong data, then the reconstruction gives the wrong result (and the NAS has no way to detect that). Similarly, if the wrong data was somehow written to one of the disks in the first place (or if a disk write was lost), then the reconstruction will fail (and there is no way to detect that).

@rabidh wrote:

it'll probably still cause corruption ... and probably the more high end systems have options in place to work around that.

Once corruption happens, then there is risk of data loss - that's just as true in high-end enterprise/cloud systems as it is in home NAS.

High-end systems have some features which can reduce the chance of corruption happening in the first place. For instance

Error-correcting RAM
Dual Power Supplies to help ensure that a PSU failure doesn't result in lost writes.
UPS protection

BTW, UPS protection is something I always recommend (for all NAS). Often data corruption occurs with unexpected power loss.

Also if you have more disks in the NAS, there are some advanced RAID modes that can handle more than one failed disk. There is a price for that (both reduction in capacity and lower performance). And they don't help if the wrong data is on one or more disks. They only help if the disk can't be read.

Newer OS-6 ReadyNAS (at all price points) do have some features that are relevant here. They have more scheduled maintenance functions, that can detect issues sooner. They also use a newer file system called BTRFS, which supports built-in checksums that can detect corruption. That also gives those NAS some more sophisticated options for reconstruction.

But for all storage (enterprise and home) the primary defense against data loss/corruption is to have independent backups - full copies of the data on other devices,

View solution in original post

StephenB · ‎2018-04-23

RAID makes data loss less likely, and it simplifies expansion of storage (without loss of availability).

However, it is not enough to keep your data safe. For that you need backups on other devices.

You are facing data recovery (either using RAID recovery software like ReclaiMe, or using a data recovery service like Netgear's).

But perhaps start by looking at the SMART stats of the remaining disks, and checking for disk errors (reallocated sectors, etc).

rabidh · ‎2018-04-23

It looks like there were some smart errors on the disk that I put in as a repacement, so I pulled that, restarted with a filesystem check, and some of my files are back (not all). Is there anything else I can do short of sending the nas off, or pulling drives and spending $200 on ReclaiMe like you suggested?

Luckily I do have some of the more important files backed up elsewhere because I lost some confidence in my ReadyNAS a while back, but I have still lost data that I needed.

Is it Netgear's official position that you can lose data each time you swap a drive on a ReadyNAS? If so it seems like I should probably reconsider my storage choices.

StephenB · ‎2018-04-23

@rabidh wrote:

Is it Netgear's official position that you can lose data each time you swap a drive on a ReadyNAS? If so it seems like I should probably reconsider my storage choices.

Netgear's OS 6 software manual (page 231 of http://www.downloads.netgear.com/files/GDC/READYNAS-100/READYNAS_OS_6_SM_EN.pdf) says

If your data is important enough to store, it is important enough to back up. Data can be lost due to a number of events, including natural disaster (for example, fire or flood), theft, improper data deletion, and hard drive failure. If you regularly back up your data, you can recover your data if any of these situations occur.

If someone else tells you something different, and you actually believe them ... then I have a bridge to sell you

Data that is not backed up is always at risk. And that risk is higher when the RAID array is being resynced due to a disk replacement.

mdgm-ntgr · ‎2018-04-23

Businesses spend e.g. tens of thousands of dollars on servers and still recognise the need to backup any data that's important to them. WIth a few hundred dollar NAS we don't have some magic fix to prevent any and every possible problem.

When a disk fails in a single-redundant volume you're in a situation of heightened risk till the RAID has been rebuilt.

Using an unhealthy disk as the replacement disk isn't going to help either.

When a disk fails it's important to check that the disks still in the NAS look good, looking at the SMART stats. If other disks are going bad as well then that does affect what the best way forward is.

Some users test new disks using SeaTools for SeaGate or WD Data LifeGuard Diagnostics for WD disks before adding them. We do run a quick check of disks before adding them but that's not going to pick up any and every problem and sometimes it's quite subjective as to whether you want to use a disk or not. The disk manufacturer's tool can be used to run lengthy disk checks.

Our software based data recovery services involve remotely connecting to NAS units. If the disks have to be sent off that adds considerably to the cost.

If you're wanting to use a data recovery service or data recovery software it's important not to keep on making changes to the data volume. Every change you make has the potential to reduce the chances of further attempts being successful.

Backing up data can be significantly cheaper than data recovery attempts which may ultimately prove completely unsuccessful.

rabidh · ‎2018-04-24

Thanks for the in-depth reply.

It seems particularly unlucky that the replacement drive I put in was faulty. Having just read into it a bit, I wasn't aware that in most RAID systems if one copy of the data becomes corrupt then even though it is duplicated, it'll probably still cause corruption. I guess that's what happened in this case, and probably the more high end systems have options in place to work around that.

Do you have a link to where the NetGear recovery service is? In my.netgear.com and 'purchase service contract' for my device I just see 'There are currently no service contracts available for this product' - I assumed because it was too old (6 years).

mdgm-ntgr · ‎2018-04-24

@rabidh wrote:

So even though the data is duplicated in RAID, if one copy of the data becomes corrupt then the ReadyNAS struggles to detect which copy is the correct one?

No, but if there's a problem with another disk in the array adding a failing disk is going to increase the chance you'll have problems. The failing disk may fail to integrate into the array properly and the rebuilds do put heavy stress on the other disks.

@rabidh wrote:

I guess as you say that's something that a more expensive business offering might sort out...

That's not what I said at all. What I was saying was that the NAS is not going to be magically not run into problems that devices that cost say a hundred times as much can still run into. RAID is great, but it's not a magic fix for any and every possible problem.

@rabidh wrote:

Do you have a link to where the NetGear recovery service is? In my.netgear.com and 'purchase service contract' for my device I just see 'There are currently no service contracts available for this product' - I assumed because it was too old (6 years).

Hmmm. We should still offer data recovery contracts (non-refundable and may be completely unsuccessful).

StephenB · ‎2018-04-24

@rabidh wrote:

It seems particularly unlucky that the replacement drive I put in was faulty. Having just read into it a bit, I wasn't aware that in most RAID systems if one copy of the data becomes corrupt then even though it is duplicated ...

In your case your NAS is using RAID-5. RAID-5 doesn't duplicate your data. Rather it uses parity blocks that allow it to reconstruct data when something is missing.

Putting this in mathematical terms: Imagine a 4-disk RAID-5 array. If disks 1,2, and 3 have A, B, and C data blocks at sector N, then the fourth disk would have P=A+B+C in that sector. (It doesn't use normal addition, but does something else that has the same effect). Then if the disk 3 is replaced, the NAS reconstructs C using P-A-B.

This only works if the remaining disks can all be read (and when all have the correct data). If a disk can't be read duiring reconstruction, then the reconstruction fails (and the NAS knows that). If a disk is read, but gives the wrong data, then the reconstruction gives the wrong result (and the NAS has no way to detect that). Similarly, if the wrong data was somehow written to one of the disks in the first place (or if a disk write was lost), then the reconstruction will fail (and there is no way to detect that).

@rabidh wrote:

it'll probably still cause corruption ... and probably the more high end systems have options in place to work around that.

Once corruption happens, then there is risk of data loss - that's just as true in high-end enterprise/cloud systems as it is in home NAS.

High-end systems have some features which can reduce the chance of corruption happening in the first place. For instance

Error-correcting RAM
Dual Power Supplies to help ensure that a PSU failure doesn't result in lost writes.
UPS protection

BTW, UPS protection is something I always recommend (for all NAS). Often data corruption occurs with unexpected power loss.

Also if you have more disks in the NAS, there are some advanced RAID modes that can handle more than one failed disk. There is a price for that (both reduction in capacity and lower performance). And they don't help if the wrong data is on one or more disks. They only help if the disk can't be read.

Newer OS-6 ReadyNAS (at all price points) do have some features that are relevant here. They have more scheduled maintenance functions, that can detect issues sooner. They also use a newer file system called BTRFS, which supports built-in checksums that can detect corruption. That also gives those NAS some more sophisticated options for reconstruction.

But for all storage (enterprise and home) the primary defense against data loss/corruption is to have independent backups - full copies of the data on other devices,

rabidh · ‎2018-04-24

Thanks - and you're totally right about the load issues, as it seems that one of the other disks just reported smart errors as well - so that won't have helped the reconstruction either.

I do have a UPS, as well as a separate computer running scheduled rsync backups, and different makes and models of hard disk in the NAS to try and avoid 2 disks going at the same time - so it's still frustrating to have lost data. I guess I should have invested in more storage and rsynced *everything*, not just the super important data.

It sounds like OS6 with BTRFS and scheduled checks is a real improvement. It's just a shame older Netgear devices aren't kept updated - if there had been scheduled checks (or alert emails via gmail hadn't silently stopped working) then this most likely could have been avoided.

rabidh · ‎2018-04-26

I'm posting again here as it looks like the new post I started on this got locked/hidden/deleted somehow? https://community.netgear.com/t5/Using-your-ReadyNAS/quot-Status-Spare-Inactive-quot-on-previously-o...

After rebooting the ReadyNAS comes up with all 3 drives showing 'Ok', and with no filesystem errors when a check is run. I'm able to copy data off just great.

However, some files (I'm not sure which ones) cause the ReadyNAS to drop one of the 3 drives - turning it from "Ok" to "Spare Inactive". No alerts are created in the console. After that, rsync fails on the majority of files with an input/output error - however a restart of the NAS shows no filesystem errors and everything starts working again.

Is there anything I can do to avoid this and get the ReadyNAS to keep all 3 drives in the array all the time while I copy the files off? Which log files should I look at to find why it's dropped the volume?

I have shell access to it and I'm a long-time Linux user (10+ years), so if this is a timeout or some setting that could be modified I'm happy to dig in.

What are my options here apart from ReclaiMe? As previously stated NetGear's support options (paid or not) are not available to me for some reason.

What if I copied all the partitions off all the drives to a new hard disk on my Linux PC (with dd conv=sync,noerror)? Is there enough metadata that mdadm could reconstruct the volumes automatically, or could I get the information needed to reconstruct from the ReadyNAS somehow?

Looking at mdstat I have several different RAID volumes using multiple partitions (I guess due to XRAID-2) so it's not going to be a matter of just setting up a single RAID5 array from them - looks like I'd have to link the 2 RAID5 arrays somehow (are they all just concatenated together?).

StephenB · ‎2018-04-26

The safest thing to do is to clone all three drives to new ones. Then power down the NAS, insert the three clones, and power up. Then your original disks remain completely intact (no chance of more issues).

You could alternatively just clone the drive that is dropping out. Then power down, swap the problem drive with the clone, and power up read-only using the boot menu.

Have you looked at the SMART stats on the drive that is dropping out? There should be something in the logs related to the drive health, mdadm issues, or btrfs issues.

rabidh · ‎2018-04-26

Perfect - thanks! So if I download the logs using the admin menu, I'm looking in system.log?

I've got some new 6tb drives arriving tomorrow for a new NAS. While they're not the same size (or supported above 4TB on the NV+ v2?) would it work if I just copy the old drive's contents onto the first 2tb of the new one - or do I need brand new 2tb drives?

While I don't have enough drives (at the moment) to replace all the drives with new, I'll image all 3 drives into existing storage so if it all goes wrong I can recover 🙂

There was one SMART alert on the drive that's having issues right after the initial sync (Reallocated sector count from 8 to 9), but nothing would appear to have been getting too much worse.

However:

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 165 159 021 Pre-fail Always - 8725
4 Start_Stop_Count 0x0032 094 094 000 Old_age Always - 6035
5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 8
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 018 018 000 Old_age Always - 60409
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 72
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 54
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 6633147
194 Temperature_Celsius 0x0022 107 101 000 Old_age Always - 45
196 Reallocated_Event_Count 0x0032 196 196 000 Old_age Always - 4
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 9
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 4
199 UDMA_CRC_Error_Count 0x0032 200 191 000 Old_age Always - 349267
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 6

UDMA_CRC_Error_Count looks pretty disasterous? Does UDMA imply a problem with the SATA link itself rather than the disk though?

Obviously if there was nothing else, power_on_hours is pretty high.

rabidh · ‎2018-04-26

Thanks! I'll give that a go. While I don't have enough drives to replace them all, I will save disk images of all of them so I can back them up.

I only had one alert reported about the reallocated sector count rising from 8 to 9. Looking at the drive stats it doesn't look too bad (apart from the almost 7 years of continuous running):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   165   159   021    Pre-fail  Always       -       8725
  4 Start_Stop_Count        0x0032   094   094   000    Old_age   Always       -       6035
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       8
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   018   018   000    Old_age   Always       -       60409
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       72
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       54
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       6633147
194 Temperature_Celsius     0x0022   107   101   000    Old_age   Always       -       45
196 Reallocated_Event_Count 0x0032   196   196   000    Old_age   Always       -       4
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       9
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       4
199 UDMA_CRC_Error_Count    0x0032   200   191   000    Old_age   Always       -       349267
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       6

However UDMA_CRC_Error_Count looks huge. I had a read around this and people are saying it's to do with the SATA link. Obviously there aren't any cables involved, so apart from re-seating the drive in the connector with the NAS off (which I've done), is there anything I can do?

Would swapping the drive to a new bay of the NAS confuse it? Of course it may be the SATA controller on the drive itself that'd dead/dying.

StephenB · ‎2018-04-26

@rabidh wrote:

There was one SMART alert on the drive that's having issues right after the initial sync (Reallocated sector count from 8 to 9), but nothing would appear to have been getting too much worse.

Reallocated Sectors: 9

Pending Sectors: 8

Uncorrectable Errors: 4

Not horribly broken, but not great either.

@rabidh wrote:

UDMA_CRC_Error_Count looks pretty disasterous? Does UDMA imply a problem with the SATA link itself rather than the disk though?

They are errors detected on the SATA link by the drive. So potential causes are the SATA backplane/connections, the NAS sata interface electronics, and the drive's sata interface electronics. Are the counts rising?

StephenB · ‎2018-04-26

@rabidh wrote:

While I don't have enough drives (at the moment) to replace all the drives with new, I'll image all 3 drives into existing storage so if it all goes wrong I can recover 🙂

That will also work. Just make sure the imaging does full sector-by-sector copying of everything on the disks.

For better or worse, the image/clone won't identify which sectors weren't properly copied. So there can be some corruption when you use the clone, since RAID can't tell which sectors it needs to reconstruct.

StephenB · ‎2018-04-26

@rabidh wrote:

UDMA_CRC_Error_Count looks pretty disasterous? Does UDMA imply a problem with the SATA link itself rather than the disk though?

They are errors detected on the SATA link by the drive. So potential causes are the SATA backplane/connections, the NAS sata interface electronics, and the drive's sata interface electronics. Are the counts rising?

Just wanted to add that this could explain the dropout of the drive - the NAS disk drives might be declaring the interface dead.

You could also try powering down, and moving the drive to a different bay. If it the SATA link (and not the drive) the array might stay up. Still best to boot the system in read-only mode.

rabidh · ‎2018-04-29

Just an update on this...

The UDMA error count hasn't gone up, so it seems that was a bit of a red herring.

However, I took that drive out and plugged it into my PC, then used `dd` with `conf=sync,noerror` and cloned it onto the 2TB drive that I'd used originally when the whole thing stopped working (I backed up *all* the drives onto a 6TB drive just in case). I got 7 IO errors from the drive I was reading, but that was it - the copy sailed though.

I put the cloned drive in, turned it on, and it now works great. I'm sure those 7 IO errors mean maybe 7 files are slightly corrupt, but that's a hell of a lot better than 2TB of lost data...

So, it looks like:

I had a legit failure of the 512GB disk, and at the same time one 2TB drive was silently a little flaky
When I swapped the 512GB disk out with the 2TB one, the ReadyNAS had an IO error and just freaked out, refusing to set it up as part of the volume and also dropping the 2TB disk from the array!
I then rebooted and all the drives came back, but as soon as I started to copy I'd hit one of those bad sectors on the disk, get an IO error, and the ReadyNAS would drop the entire volume until I rebooted again.

So yeah, not impressed with ReadyNAS on this. I can understand dropping a volume due to IO errors when you're in a redundant array, but doing so when in an unprotected array *and sending no alert messages about it at all* seems like a really bad choice. The lack of official support from Netgear when the solution was so simple was a bit of an eye-opener too.

After two different ReadyNAS and 7 years of ownership I received a Synology NAS yesterday. While the build quality isn't as good as the ReadyNAS I'm blown away by the software (and the speed!) - I'm a total convert.

I'll still be keeping separate backups though 🙂

StephenB · ‎2018-04-29

FWIW, both vendors use the same linux tools to build their RAID arrays (mdadm), so the response to a disk error would likely be identical with your Synology.

Disk fail in X-RAID2, after sync half my files are gone!

Disk fail in X-RAID2, after sync half my files are gone!

Re: Disk fail in X-RAID2, after sync half my files are gone!

Re: Disk fail in X-RAID2, after sync half my files are gone!

Re: Disk fail in X-RAID2, after sync half my files are gone!

Re: Disk fail in X-RAID2, after sync half my files are gone!

Re: Disk fail in X-RAID2, after sync half my files are gone!

Re: Disk fail in X-RAID2, after sync half my files are gone!

Re: Disk fail in X-RAID2, after sync half my files are gone!

Re: Disk fail in X-RAID2, after sync half my files are gone!

Re: Disk fail in X-RAID2, after sync half my files are gone!

"Status: Spare Inactive" on previously ok drive

Re: "Status: Spare Inactive" on previously ok drive

Re: "Status: Spare Inactive" on previously ok drive

Re: "Status: Spare Inactive" on previously ok drive

Re: "Status: Spare Inactive" on previously ok drive

Re: "Status: Spare Inactive" on previously ok drive

Re: "Status: Spare Inactive" on previously ok drive

Re: "Status: Spare Inactive" on previously ok drive

Re: "Status: Spare Inactive" on previously ok drive