Disk fail in X-RAID2, after sync half my files are gone!

Aspirant

Apr 26, 2018

I'm posting again here as it looks like the new post I started on this got locked/hidden/deleted somehow? https://community.netgear.com/t5/Using-your-ReadyNAS/quot-Status-Spare-Inactive-quot-on-previously-ok-drive/td-p/1558637

After rebooting the ReadyNAS comes up with all 3 drives showing 'Ok', and with no filesystem errors when a check is run. I'm able to copy data off just great.

However, some files (I'm not sure which ones) cause the ReadyNAS to drop one of the 3 drives - turning it from "Ok" to "Spare Inactive". No alerts are created in the console. After that, rsync fails on the majority of files with an input/output error - however a restart of the NAS shows no filesystem errors and everything starts working again.

Is there anything I can do to avoid this and get the ReadyNAS to keep all 3 drives in the array all the time while I copy the files off? Which log files should I look at to find why it's dropped the volume?

I have shell access to it and I'm a long-time Linux user (10+ years), so if this is a timeout or some setting that could be modified I'm happy to dig in.

What are my options here apart from ReclaiMe? As previously stated NetGear's support options (paid or not) are not available to me for some reason.

What if I copied all the partitions off all the drives to a new hard disk on my Linux PC (with dd conv=sync,noerror)? Is there enough metadata that mdadm could reconstruct the volumes automatically, or could I get the information needed to reconstruct from the ReadyNAS somehow?

Looking at mdstat I have several different RAID volumes using multiple partitions (I guess due to XRAID-2) so it's not going to be a matter of just setting up a single RAID5 array from them - looks like I'd have to link the 2 RAID5 arrays somehow (are they all just concatenated together?).

StephenB
Guru - Experienced User
Apr 26, 2018
The safest thing to do is to clone all three drives to new ones. Then power down the NAS, insert the three clones, and power up. Then your original disks remain completely intact (no chance of more issues).

You could alternatively just clone the drive that is dropping out. Then power down, swap the problem drive with the clone, and power up read-only using the boot menu.

Have you looked at the SMART stats on the drive that is dropping out? There should be something in the logs related to the drive health, mdadm issues, or btrfs issues.

rabidh

Aspirant

Apr 26, 2018

Perfect - thanks! So if I download the logs using the admin menu, I'm looking in system.log?

I've got some new 6tb drives arriving tomorrow for a new NAS. While they're not the same size (or supported above 4TB on the NV+ v2?) would it work if I just copy the old drive's contents onto the first 2tb of the new one - or do I need brand new 2tb drives?

While I don't have enough drives (at the moment) to replace all the drives with new, I'll image all 3 drives into existing storage so if it all goes wrong I can recover :)

There was one SMART alert on the drive that's having issues right after the initial sync (Reallocated sector count from 8 to 9), but nothing would appear to have been getting too much worse.

However:

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 165 159 021 Pre-fail Always - 8725
4 Start_Stop_Count 0x0032 094 094 000 Old_age Always - 6035
5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 8
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 018 018 000 Old_age Always - 60409
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 72
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 54
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 6633147
194 Temperature_Celsius 0x0022 107 101 000 Old_age Always - 45
196 Reallocated_Event_Count 0x0032 196 196 000 Old_age Always - 4
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 9
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 4
199 UDMA_CRC_Error_Count 0x0032 200 191 000 Old_age Always - 349267
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 6

UDMA_CRC_Error_Count looks pretty disasterous? Does UDMA imply a problem with the SATA link itself rather than the disk though?

Obviously if there was nothing else, power_on_hours is pretty high.

rabidh

Aspirant

Apr 26, 2018

Thanks! I'll give that a go. While I don't have enough drives to replace them all, I will save disk images of all of them so I can back them up.

I only had one alert reported about the reallocated sector count rising from 8 to 9. Looking at the drive stats it doesn't look too bad (apart from the almost 7 years of continuous running):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   165   159   021    Pre-fail  Always       -       8725
  4 Start_Stop_Count        0x0032   094   094   000    Old_age   Always       -       6035
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       8
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   018   018   000    Old_age   Always       -       60409
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       72
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       54
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       6633147
194 Temperature_Celsius     0x0022   107   101   000    Old_age   Always       -       45
196 Reallocated_Event_Count 0x0032   196   196   000    Old_age   Always       -       4
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       9
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       4
199 UDMA_CRC_Error_Count    0x0032   200   191   000    Old_age   Always       -       349267
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       6

However UDMA_CRC_Error_Count looks huge. I had a read around this and people are saying it's to do with the SATA link. Obviously there aren't any cables involved, so apart from re-seating the drive in the connector with the NAS off (which I've done), is there anything I can do?

Would swapping the drive to a new bay of the NAS confuse it? Of course it may be the SATA controller on the drive itself that'd dead/dying.

StephenB
Guru - Experienced User
Apr 26, 2018
rabidh wrote:

There was one SMART alert on the drive that's having issues right after the initial sync (Reallocated sector count from 8 to 9), but nothing would appear to have been getting too much worse.

Reallocated Sectors: 9

Pending Sectors: 8

Uncorrectable Errors: 4

Not horribly broken, but not great either.

rabidh wrote:

UDMA_CRC_Error_Count looks pretty disasterous? Does UDMA imply a problem with the SATA link itself rather than the disk though?

They are errors detected on the SATA link by the drive. So potential causes are the SATA backplane/connections, the NAS sata interface electronics, and the drive's sata interface electronics. Are the counts rising?
StephenB
Guru - Experienced User
Apr 26, 2018
rabidh wrote:

While I don't have enough drives (at the moment) to replace all the drives with new, I'll image all 3 drives into existing storage so if it all goes wrong I can recover :)

That will also work. Just make sure the imaging does full sector-by-sector copying of everything on the disks.

For better or worse, the image/clone won't identify which sectors weren't properly copied. So there can be some corruption when you use the clone, since RAID can't tell which sectors it needs to reconstruct.
StephenB
Guru - Experienced User
Apr 26, 2018

rabidh wrote:

UDMA_CRC_Error_Count looks pretty disasterous? Does UDMA imply a problem with the SATA link itself rather than the disk though?

They are errors detected on the SATA link by the drive. So potential causes are the SATA backplane/connections, the NAS sata interface electronics, and the drive's sata interface electronics. Are the counts rising?

Just wanted to add that this could explain the dropout of the drive - the NAS disk drives might be declaring the interface dead.

You could also try powering down, and moving the drive to a different bay. If it the SATA link (and not the drive) the array might stay up. Still best to boot the system in read-only mode.
rabidh
Aspirant
Apr 29, 2018
Just an update on this...

The UDMA error count hasn't gone up, so it seems that was a bit of a red herring.

However, I took that drive out and plugged it into my PC, then used `dd` with `conf=sync,noerror` and cloned it onto the 2TB drive that I'd used originally when the whole thing stopped working (I backed up *all* the drives onto a 6TB drive just in case). I got 7 IO errors from the drive I was reading, but that was it - the copy sailed though.

I put the cloned drive in, turned it on, and it now works great. I'm sure those 7 IO errors mean maybe 7 files are slightly corrupt, but that's a hell of a lot better than 2TB of lost data...

So, it looks like:

I had a legit failure of the 512GB disk, and at the same time one 2TB drive was silently a little flaky

When I swapped the 512GB disk out with the 2TB one, the ReadyNAS had an IO error and just freaked out, refusing to set it up as part of the volume and also dropping the 2TB disk from the array!

I then rebooted and all the drives came back, but as soon as I started to copy I'd hit one of those bad sectors on the disk, get an IO error, and the ReadyNAS would drop the entire volume until I rebooted again.

So yeah, not impressed with ReadyNAS on this. I can understand dropping a volume due to IO errors when you're in a redundant array, but doing so when in an unprotected array *and sending no alert messages about it at all* seems like a really bad choice. The lack of official support from Netgear when the solution was so simple was a bit of an eye-opener too.

After two different ReadyNAS and 7 years of ownership I received a Synology NAS yesterday. While the build quality isn't as good as the ReadyNAS I'm blown away by the software (and the speed!) - I'm a total convert.

I'll still be keeping separate backups though :)
StephenB
Guru - Experienced User
Apr 30, 2018
FWIW, both vendors use the same linux tools to build their RAID arrays (mdadm), so the response to a disk error would likely be identical with your Synology.

Forum Discussion

Disk fail in X-RAID2, after sync half my files are gone!

Related Content

How to recover/read data on 4 X-Raid2 disks

Cos'è X-RAID2 ?

ReadyNAS 204 X-Raid, X-Raid2 HUH!?

Question on X-RAID2 RAID6

X-RAID2 "Dead", even though still fully-accessible

NETGEAR Academy

ProSupport for Business