Re: Replacing failed disk in RN426

joey123 · ‎2018-08-15

One day I rebooted my RN426, and one of my disks (disk4, Seagate 3TB) was not longer recognized, so the volume was marked as degraded. Rebooted again, and it came up, volume resynced, everything OK.

Next time, Disk 4 dropped and didn't come back on a reboot. So I bought a new HD (WD Red 8TB) to replace it. I popped in the disk, and.... nothing happened. The blue light comes on, indicating that the disk is in there, and it appears to be powered on (it's warm if I pull it out), but it doesn't show up in the volume page (not even as a disk at all, let alone as part of the array). Looks like the NAS doesn't even recognize the existence of the disk at all. Tried with a few different disks, in slots 4 and 5, and same problem. Disk looks fine if I pop it into an external enclosure and bring it up on my mac, but so does the old one.

What is going on here? Is it possible that I have some sort of hardware failure in my NAS, or perhaps there's a magic incantation I forgot to do?

mdgm-ntgr · ‎2018-08-15

Hardware failure is a possibility, but there are other possibilities.

Please send in your logs (see the Sending Logs link in my sig).

Please also go to System > Settings > Support, enable Secure Diagnostic Mode and PM (Private Message) me the 5-digit number you then get or include it on your email.

Do you see SMART errors for the old disk 4 in smart_history.log?

joey123 · ‎2018-08-15

Logs sent.

The smart status looks a bit odd.

time model serial realloc_sect realloc_evnt spin_retry_cnt ioedc cmd_timeouts pending_sect uncorrectable_err ata_errors
------------------- -------------------- -------------------- ------------ ------------ -------------- ---------- ------------ ------------ ----------------- ----------
2015-02-21 19:03:53 WDC WD20EARX-32PASB0 WD-XXXXXXXXXXXX -1 -1 -1 -1 -1 -1 -1 0 
2015-02-21 19:03:53 ST3000DM001-1ER166 XXXXXRK -1 -1 -1 -1 -1 -1 -1 0 
2015-02-21 19:03:53 ST3000DM001-1ER166 XXXXXPV -1 -1 -1 -1 -1 -1 -1 0 
2015-02-21 19:03:53 ST3000DM001-1ER166 XXXXXXV -1 -1 -1 -1 -1 -1 -1 0 
2015-02-21 19:05:41 WDC WD20EARX-32PASB0 WD-XXXXXXXXXXXX 0 0 0 -1 -1 0 0 0 
2015-02-21 19:05:41 ST3000DM001-1ER166 XXXXXXRK 0 0 0 0 0 0 0 0 
2015-02-21 19:05:41 ST3000DM001-1ER166 XXXXXXPV 0 0 0 0 0 0 0 0 
2015-02-21 19:05:41 ST3000DM001-1ER166 XXXXXXXV 0 0 0 0 0 0 0 0 
2017-02-22 19:34:04 WDC WD80EFZX-68UW8N0 XXXXXX4Y 0 0 0 -1 -1 0 0 0 
2017-09-10 03:38:24 ST3000DM001-1ER166 XXXXXXPV 0 0 0 0 0 8 8 0 
2017-10-04 18:03:51 ST3000DM001-1ER166 XXXXXXPV 0 0 0 0 0 0 0 2

However, the disk that failed is W500F7XV, and that's not the one that has errors reported for it. Disk *PV is still in there, and I've definitely scrubbed multiple times since 2017-10-04 without uncovering any data corruption. Potentially any errors it has are in sectors that didn't hold any data, or did the NAS manage to recover them silently from the other 3 drives at the time?

mdgm-ntgr · ‎2018-08-15

Your 3TB disks remaining in the NAS have been powered on for about 3.25 years. The ST3000DM001 has notoriously high failure rates when used in RAID arrays.

Your 8TB WD RED is a much better choice.

A few ATA errors in and of itself may not suggest that a disk is failing just yet, but if the count continues to increase rapidly or increases a lot that would suggest a problem.

joey123 · ‎2018-08-15

Yes, definitely true. If I can now just get it to recognize the new Red, I'll be able to slowly swap out the seagates.

Also, something that didn't seem relevant, but I might as well put out there.

Not long before this had some general weirdness about not being able to remove an app (resilio sync), so I did what another forum post suggested (force reinstall the OS by "upgrading" to the existing version 3.9.3 using a downloaded zip archive), and it cleaned that much up at least. After that I was able to uninstall and upgrade the app no problem. However, I think it was at a reboot near that time that the disk4 dropped for good.

I've also noticed that disk4 tended to drop in restarts but not in shutdown-reboot cycles. Potentially something about spin up time perhaps? Sample size is small though, about 3 or so, so I'm not sure this is actually a real effect.

mdgm-ntgr · ‎2018-08-15

Interesting. I see now this array was moved across from an ARM box.

Not sure if some of the weirdness you saw with Resiliio Sync may be related to that. Did you uninstall all apps before moving your disks across to the RN426?

The disk dropping typically on restarts but not on shutdown-reboots is interesting. Though as you say 3 attempts is a small sample size.

joey123 · ‎2018-08-15

If I recall correctly I had to wipe all the apps after moving the array across, since the old apps were ARM so they didn't work on this machine. I didn't notice much weirdness other than that, but you could be right that the continuing strange issues around app installs an uninstalls could be related.

One thing is that lsblk doesn't turn up the new disk either. And it doesn't matter which slot I put it in (4, 5, or 6). I guess I could pull all the disks and plop it in slot 1 to see if the machine sees it at all, but that's getting pretty invasive. Also not completely sure lsblk should be able to see unformatted and unmounted disks. Internally, is the machine partitioned into bays 1,2,3 and bays 4,5,6? Might explain it if 1 of the two controllers went bad, or a loose connector, or something like that. However, the disks do power on, so they're not totally disconnected.

joey123 · ‎2018-08-15

Regarding reboots vs. shutdowns, it's definitely true that in a cold start the disks have spun up long before the OS even thinks about loading, but in a reboot it seems to be much faster. Perhaps something about the vibration of spin up causing one of the disks to bail, was always my theory anyway. I would assume that's a red herring, but since it's the same disk, I'm not so sure. Maybe points to loose connector somewhere internally, if it's the case that bays 4,5,6 are isolated from bays 1,2,3 it could make sense.

mdgm-ntgr · ‎2018-08-15

Have you checked the old SeaGate disk using SeaTools and the new WD disk using WD Data LifeGuard Diagnostics.

I believe we use a couple of controllers on our 6 bay units, but not sure how many drive bays would be on each. If the disks are fine then hardware failure or a loose connector or something like that would be a possibility.

Don't open the NAS case up. We don't support doing that. If you suspect a hardware problem it would be best to open a support case. First you should check the disks are healthy.

StephenB · ‎2018-08-16

I had a similar problem with getting my Pro-6 to recognize a replacement disk a few years ago. It turns out that the linux disk driver had disabled the port when the original disk failed - likely a consequence of the way it failed. Powering down and then restarting worked in that case.

Did you try powering down the RN426 (perhaps removing the power cable for a couple minutes), and then restarting? That might kick-start the system into recognizing the disk. The theory on removing the power cable is that removing all power will clear out any live bios state.