NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.
Forum Discussion
rabidh
Apr 23, 2018Aspirant
Disk fail in X-RAID2, after sync half my files are gone!
Hi, I'm on a ReadyNAS NV+ v2, with RAIDiator 5.3.11. I had it configured for X-RAID2 with 3x 2GB drives and one older 512GB drive. A few days ago the 512GB drive failed, leaving the array unprote...
- Apr 24, 2018
rabidh wrote:
It seems particularly unlucky that the replacement drive I put in was faulty. Having just read into it a bit, I wasn't aware that in most RAID systems if one copy of the data becomes corrupt then even though it is duplicated ...
In your case your NAS is using RAID-5. RAID-5 doesn't duplicate your data. Rather it uses parity blocks that allow it to reconstruct data when something is missing.
Putting this in mathematical terms: Imagine a 4-disk RAID-5 array. If disks 1,2, and 3 have A, B, and C data blocks at sector N, then the fourth disk would have P=A+B+C in that sector. (It doesn't use normal addition, but does something else that has the same effect). Then if the disk 3 is replaced, the NAS reconstructs C using P-A-B.
This only works if the remaining disks can all be read (and when all have the correct data). If a disk can't be read duiring reconstruction, then the reconstruction fails (and the NAS knows that). If a disk is read, but gives the wrong data, then the reconstruction gives the wrong result (and the NAS has no way to detect that). Similarly, if the wrong data was somehow written to one of the disks in the first place (or if a disk write was lost), then the reconstruction will fail (and there is no way to detect that).
rabidh wrote:
it'll probably still cause corruption ... and probably the more high end systems have options in place to work around that.
Once corruption happens, then there is risk of data loss - that's just as true in high-end enterprise/cloud systems as it is in home NAS.
High-end systems have some features which can reduce the chance of corruption happening in the first place. For instance
- Error-correcting RAM
- Dual Power Supplies to help ensure that a PSU failure doesn't result in lost writes.
- UPS protection
BTW, UPS protection is something I always recommend (for all NAS). Often data corruption occurs with unexpected power loss.
Also if you have more disks in the NAS, there are some advanced RAID modes that can handle more than one failed disk. There is a price for that (both reduction in capacity and lower performance). And they don't help if the wrong data is on one or more disks. They only help if the disk can't be read.
Newer OS-6 ReadyNAS (at all price points) do have some features that are relevant here. They have more scheduled maintenance functions, that can detect issues sooner. They also use a newer file system called BTRFS, which supports built-in checksums that can detect corruption. That also gives those NAS some more sophisticated options for reconstruction.
But for all storage (enterprise and home) the primary defense against data loss/corruption is to have independent backups - full copies of the data on other devices,
rabidh
Apr 24, 2018Aspirant
Thanks - and you're totally right about the load issues, as it seems that one of the other disks just reported smart errors as well - so that won't have helped the reconstruction either.
I do have a UPS, as well as a separate computer running scheduled rsync backups, and different makes and models of hard disk in the NAS to try and avoid 2 disks going at the same time - so it's still frustrating to have lost data. I guess I should have invested in more storage and rsynced *everything*, not just the super important data.
It sounds like OS6 with BTRFS and scheduled checks is a real improvement. It's just a shame older Netgear devices aren't kept updated - if there had been scheduled checks (or alert emails via gmail hadn't silently stopped working) then this most likely could have been avoided.
rabidh
Apr 26, 2018Aspirant
I'm posting again here as it looks like the new post I started on this got locked/hidden/deleted somehow? https://community.netgear.com/t5/Using-your-ReadyNAS/quot-Status-Spare-Inactive-quot-on-previously-ok-drive/td-p/1558637
After rebooting the ReadyNAS comes up with all 3 drives showing 'Ok', and with no filesystem errors when a check is run. I'm able to copy data off just great.
However, some files (I'm not sure which ones) cause the ReadyNAS to drop one of the 3 drives - turning it from "Ok" to "Spare Inactive". No alerts are created in the console. After that, rsync fails on the majority of files with an input/output error - however a restart of the NAS shows no filesystem errors and everything starts working again.
Is there anything I can do to avoid this and get the ReadyNAS to keep all 3 drives in the array all the time while I copy the files off? Which log files should I look at to find why it's dropped the volume?
I have shell access to it and I'm a long-time Linux user (10+ years), so if this is a timeout or some setting that could be modified I'm happy to dig in.
What are my options here apart from ReclaiMe? As previously stated NetGear's support options (paid or not) are not available to me for some reason.
What if I copied all the partitions off all the drives to a new hard disk on my Linux PC (with dd conv=sync,noerror)? Is there enough metadata that mdadm could reconstruct the volumes automatically, or could I get the information needed to reconstruct from the ReadyNAS somehow?
Looking at mdstat I have several different RAID volumes using multiple partitions (I guess due to XRAID-2) so it's not going to be a matter of just setting up a single RAID5 array from them - looks like I'd have to link the 2 RAID5 arrays somehow (are they all just concatenated together?).
- StephenBApr 26, 2018Guru - Experienced User
The safest thing to do is to clone all three drives to new ones. Then power down the NAS, insert the three clones, and power up. Then your original disks remain completely intact (no chance of more issues).
You could alternatively just clone the drive that is dropping out. Then power down, swap the problem drive with the clone, and power up read-only using the boot menu.
Have you looked at the SMART stats on the drive that is dropping out? There should be something in the logs related to the drive health, mdadm issues, or btrfs issues.
- rabidhApr 26, 2018Aspirant
Perfect - thanks! So if I download the logs using the admin menu, I'm looking in system.log?
I've got some new 6tb drives arriving tomorrow for a new NAS. While they're not the same size (or supported above 4TB on the NV+ v2?) would it work if I just copy the old drive's contents onto the first 2tb of the new one - or do I need brand new 2tb drives?
While I don't have enough drives (at the moment) to replace all the drives with new, I'll image all 3 drives into existing storage so if it all goes wrong I can recover :)
There was one SMART alert on the drive that's having issues right after the initial sync (Reallocated sector count from 8 to 9), but nothing would appear to have been getting too much worse.
However:
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 165 159 021 Pre-fail Always - 8725 4 Start_Stop_Count 0x0032 094 094 000 Old_age Always - 6035 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 8 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 018 018 000 Old_age Always - 60409 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 72 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 54 193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 6633147 194 Temperature_Celsius 0x0022 107 101 000 Old_age Always - 45 196 Reallocated_Event_Count 0x0032 196 196 000 Old_age Always - 4 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 9 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 4 199 UDMA_CRC_Error_Count 0x0032 200 191 000 Old_age Always - 349267 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 6
UDMA_CRC_Error_Count looks pretty disasterous? Does UDMA imply a problem with the SATA link itself rather than the disk though?
Obviously if there was nothing else, power_on_hours is pretty high.
- rabidhApr 26, 2018Aspirant
Thanks! I'll give that a go. While I don't have enough drives to replace them all, I will save disk images of all of them so I can back them up.
I only had one alert reported about the reallocated sector count rising from 8 to 9. Looking at the drive stats it doesn't look too bad (apart from the almost 7 years of continuous running):
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 165 159 021 Pre-fail Always - 8725 4 Start_Stop_Count 0x0032 094 094 000 Old_age Always - 6035 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 8 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 018 018 000 Old_age Always - 60409 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 72 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 54 193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 6633147 194 Temperature_Celsius 0x0022 107 101 000 Old_age Always - 45 196 Reallocated_Event_Count 0x0032 196 196 000 Old_age Always - 4 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 9 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 4 199 UDMA_CRC_Error_Count 0x0032 200 191 000 Old_age Always - 349267 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 6
However UDMA_CRC_Error_Count looks huge. I had a read around this and people are saying it's to do with the SATA link. Obviously there aren't any cables involved, so apart from re-seating the drive in the connector with the NAS off (which I've done), is there anything I can do?
Would swapping the drive to a new bay of the NAS confuse it? Of course it may be the SATA controller on the drive itself that'd dead/dying.
- StephenBApr 26, 2018Guru - Experienced User
rabidh wrote:
There was one SMART alert on the drive that's having issues right after the initial sync (Reallocated sector count from 8 to 9), but nothing would appear to have been getting too much worse.
Reallocated Sectors: 9
Pending Sectors: 8
Uncorrectable Errors: 4
Not horribly broken, but not great either.
rabidh wrote:
UDMA_CRC_Error_Count looks pretty disasterous? Does UDMA imply a problem with the SATA link itself rather than the disk though?
They are errors detected on the SATA link by the drive. So potential causes are the SATA backplane/connections, the NAS sata interface electronics, and the drive's sata interface electronics. Are the counts rising?
- StephenBApr 26, 2018Guru - Experienced User
rabidh wrote:
While I don't have enough drives (at the moment) to replace all the drives with new, I'll image all 3 drives into existing storage so if it all goes wrong I can recover :)
That will also work. Just make sure the imaging does full sector-by-sector copying of everything on the disks.
For better or worse, the image/clone won't identify which sectors weren't properly copied. So there can be some corruption when you use the clone, since RAID can't tell which sectors it needs to reconstruct.
- StephenBApr 26, 2018Guru - Experienced User
rabidh wrote:
UDMA_CRC_Error_Count looks pretty disasterous? Does UDMA imply a problem with the SATA link itself rather than the disk though?
They are errors detected on the SATA link by the drive. So potential causes are the SATA backplane/connections, the NAS sata interface electronics, and the drive's sata interface electronics. Are the counts rising?
Just wanted to add that this could explain the dropout of the drive - the NAS disk drives might be declaring the interface dead.
You could also try powering down, and moving the drive to a different bay. If it the SATA link (and not the drive) the array might stay up. Still best to boot the system in read-only mode.
- rabidhApr 29, 2018Aspirant
Just an update on this...
The UDMA error count hasn't gone up, so it seems that was a bit of a red herring.
However, I took that drive out and plugged it into my PC, then used `dd` with `conf=sync,noerror` and cloned it onto the 2TB drive that I'd used originally when the whole thing stopped working (I backed up *all* the drives onto a 6TB drive just in case). I got 7 IO errors from the drive I was reading, but that was it - the copy sailed though.
I put the cloned drive in, turned it on, and it now works great. I'm sure those 7 IO errors mean maybe 7 files are slightly corrupt, but that's a hell of a lot better than 2TB of lost data...
So, it looks like:
- I had a legit failure of the 512GB disk, and at the same time one 2TB drive was silently a little flaky
- When I swapped the 512GB disk out with the 2TB one, the ReadyNAS had an IO error and just freaked out, refusing to set it up as part of the volume and also dropping the 2TB disk from the array!
- I then rebooted and all the drives came back, but as soon as I started to copy I'd hit one of those bad sectors on the disk, get an IO error, and the ReadyNAS would drop the entire volume until I rebooted again.
So yeah, not impressed with ReadyNAS on this. I can understand dropping a volume due to IO errors when you're in a redundant array, but doing so when in an unprotected array *and sending no alert messages about it at all* seems like a really bad choice. The lack of official support from Netgear when the solution was so simple was a bit of an eye-opener too.
After two different ReadyNAS and 7 years of ownership I received a Synology NAS yesterday. While the build quality isn't as good as the ReadyNAS I'm blown away by the software (and the speed!) - I'm a total convert.
I'll still be keeping separate backups though :)
- StephenBApr 30, 2018Guru - Experienced User
FWIW, both vendors use the same linux tools to build their RAID arrays (mdadm), so the response to a disk error would likely be identical with your Synology.
Related Content
NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology!
Join Us!