Forum Discussion

Aspirant

Jul 27, 2016

Solved

NAS Slow, Reboot Slow, Drive Light Blinking

My ReadyNAS 516 has been unreasonably slow lately. The system is not sharing files properly, seems to be accessing slow for both reading and writing, and even the web interface is slow. The drive a...

Business Discussions

BtrieveBill
Jul 27, 2016
I would have also opted to replace Drive 6, if it were an option. However, Drive 2 was the one blinking incessently, and even though it had fewer errors, it was apparently the squeakiest wheel today. Further, the reboot NEVER finished. It hung at 94% for over 90 minutes.

I finally gave up on the reboot and powered down the ReadyNAS entirely a second time, replaced Drive 2, and rebooted. As advertised, it booted up in about 5 minutes, detected the degraded array, and immediately started the Rebuild Process. The system is now working substantially better, and even with the RAID rebuild running, it is turning out better performance than I was getting all this week. I can now send that drive back to WD, get the replacement, and then swap out drive 6 later on. (Strangely, drive 6 was the only drive that had been replaced once before. When the new drive 6 was put in, drive 6 started spewing errors after about a week. This makes me wonder if there is not a problem with the SATA controller or cabling, and that perhaps drive 6 is really OK.)

Lessons learned:
1) Don't assume that the system is working properly, just becasue the Web console shows all drives are green.
2) Don't assume that the drive with the most errors is the one with the biggest problem.
3) Ignore the data in the logs and just replace the drive that is blinking out of sync with everyone else.
4) Always have at least one spare drive on standby.

BtrieveBill

Aspirant

Aug 02, 2016

If a volume can't be mounted, I believe that the system should provide some sort of solution, or even an explanation.

In the Status.LOG, I see only these messages:

[16/07/27 13:39:00 CDT] info:system:LOGMSG_START_READYNASD ReadyNASOS background service started.
[16/07/27 13:39:51 CDT] notice:volume:LOGMSG_RESILVERSTARTED_VOLUME Resyncing started for Volume data.
[16/07/27 14:27:00 CDT] notice:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 2 (Internal) changed state from RESYNC to ONLINE.
[16/07/27 14:29:53 CDT] notice:disk:LOGMSG_ZFS_DISK_STATUS_CHANGED Disk in channel 2 (Internal) changed state from ONLINE to RESYNC.
[16/07/27 22:11:49 CDT] notice:volume:LOGMSG_RESILVERCOMPLETE_DEGRADED_VOLUME The resync operation finished on volume data. However, the volume is still degraded.
[16/07/28 01:00:08 CDT] warning:volume:LOGMSG_HEALTH_VOLUME_WARN Volume data is Degraded.
[16/07/28 08:54:02 CDT] notice:system:LOGMSG_SYSTEM_REBOOT The system is rebooting.

The last message was after the volume was already gone, and it was my attempt to hopefully bring it back to life. One warning about the volume being degraded at 1am (for which I should be able to survive two drive failures), then kaput -- that is my issue.

Most other log files don't go back to anything older than the volume recreation, so figuring out what happened is likely impossible by now.

Again, my point is simple: I have experience with RAID arrays going back to the Compaq IDA, running eight drives at a whopping 200MB each. Over the years, I've worked with MegaRAID controllers, Adaptec controllers, Dell controllers, and more. None are free of issues, of course. However, I have never seen any simply wipe out a volume like this with no other indication. (Well, I did have a Dell support tech once replace the wrong drive from a degraded RAID5 array -- that was catastrophic, of course.) Had a second (and third) drive failed, then I would have been more forgiving. However, to have a system accept a new drive, rebuild the volume to the point where everything is fine and you breathe a sigh of review, and then wipe it all out again with nary a blip? Shame, shame.

omicron_persei8

Luminary

Aug 02, 2016

BtrieveBill wrote:
However, to have a system accept a new drive, rebuild the volume to the point where everything is fine and you breathe a sigh of review, and then wipe it all out again with nary a blip? Shame, shame.

About that part, I believe what happened is that a second HDD failed while a first HDD was resyncing. So there isn't much else that could have been done (cloning the second dying HDD before resyncing maybe).

The resync process can be very stressful for all the HDDs in the RAID5 array. It is actually more common than generally thought to see a drive fail while another one is syncing.

The rest I agree with, the logs, the GUI would benefit from some improvement on how to display, explain what's going on. Unless you're very familiar with the system, there is little chance you can trully understand what's happening... so what's the root cause...

StephenB
Guru - Experienced User
Aug 02, 2016
omicron_persei8 wrote:

About that part, I believe what happened is that a second HDD failed while a first HDD was resyncing.

Possibly (and disk 6 was also showing some errors). But the log shows the resync as completing (though the volume was still degraded). If a second HDD failed during resync, then the resync should have failed.

It'd be useful to get a clearer picture on what happened.