× NETGEAR will be terminating ReadyCLOUD service by July 1st, 2023. For more details click here.
Orbi WiFi 7 RBE973
Reply

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

MaxxMark
Luminary

[Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

For a while now, my NAS (6.5 in X-RAID-2) locks up after a random timeperiod, after which I need to do a hard poweroff and restart of the NAS to get it responsive again.

 

Unfortunately I havn't had a situation where I was logged in around the time it actually happens and could not see what is running and if something is wrong at the specific time.

 

Based on the various instances I have reconstructed the following symptoms:

1. The ping-times skyrocket (in the amount of 8000+ ms). The webinterface or SSH is not accessable anymore

2. The NAS doesn't respond anymore to ping requests. 

3. Going to the NAS and pressing the powerbutton displays that pressing again will shut it down. Doing so will result in the display showing "Shutting down", after which nothing happens (waited a few hours, but it didnt shut down)

4. The display becomes unresponsive as well (pressing the button will not turn the display on)

 

In all situations I have to press and hold the power button to completely shut it down and restart the NAS. In most cases when turning it back on again the NAS boots normally. Only in 1 instance (today) the hard-reboot triggered a (re)sync. Probably due to some corruption. Which prompted the reason to finally open a ticket. (incidetally; a periodic scrubbing ran yesterday and completed without errors)

 

What I have tried:

- Freeing up more space. It now has 1.2TB free of the 11TB (which is more than te recommended 10%)

- Performed various maintenance actions (Scrubbing and balancing)

- Turning off different services which were running on the NAS

- Going through the logs to find anything that could indicate a problem

- Searched the forum (came across someone with similar problems, but it seemed to indicate the router as the problem)

 

What I havn't tried:

- Upgrading to 6.6 (due to messages in the forum about volumes becomming unavailable, I'm reluctant to do so before secureing backups)

 

Any help debugging/finding the problem would be appreciated!

Model: RNDP600E |ReadyNAS Pro Pioneer Edition|EOL
Message 1 of 71
StephenB
Guru

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

Probably you've done this, but just in case:  Have you looked at the disk health?  Perhaps download the logs and look in volume.log, smart_history.log.

Message 2 of 71
MaxxMark
Luminary

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

Thanks for replying, but, indeed, I have checked those.

 

(In fact, they triggered me looking further into it as the smart_history displayed anomalies. But it referred to a disk having errors which was replaced a few months ago)

Message 3 of 71
StephenB
Guru

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

What disks are you using?

 

You could try PMing @mdgm-ntgr and ask if he is willing to look at the logs.

Message 4 of 71
MaxxMark
Luminary

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

Since previous (long in the past) incidents regarding disks, I switched to WD:Red disks only. So all are WD30EFRX.

 

I'll PM mdgm tomorrow (if he hasn't come across this topic before that)

Message 5 of 71
StephenB
Guru

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval


@MaxxMark wrote:

Since previous (long in the past) incidents regarding disks, I switched to WD:Red disks only. So all are WD30EFRX.

 


That's what's in my pro also.  I use bigger WDC Red drives in my other ReadyNAS.

 

I was asking on the off-chance that you might have been using SMR drives - obviously not.

 

 

 

 

Message 6 of 71
mdgm-ntgr
NETGEAR Employee Retired

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

You could send in your logs (see the Sending Logs link in my sig)

Message 7 of 71
MaxxMark
Luminary

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

I have just sent them to you. 

 

Thanks in advance!

Message 8 of 71
mdgm-ntgr
NETGEAR Employee Retired

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

Thanks.

Looks like you've got a fair bit of stuff installed on your system including stuff you've installed via SSH

 

Also your data volume is getting quite full. 

Message 9 of 71
MaxxMark
Luminary

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

Yes I have stuff installed indeed. I did however tried disabeling them one by one and gave it some time.

 

I can however try it again (just to make 100% sure it is not any of the installed apps).

 

De data volume is indeed getting fulle (though; I still find that 1.2 TB of free space as a minimum feels a bit overkill, but i'm no expert on btrfs and its performance).

 

Any advice beside testing with/without the various apps installed?

Message 10 of 71
mdgm-ntgr
NETGEAR Employee Retired

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

To maintain optimal performance it would be ideal to have the volume usage down to perhaps about 80-85%. Volume maintenance should also be run from time to time.

Message 11 of 71
MaxxMark
Luminary

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

I have volume maintenance scheduled:

- Scrubbing is done once every 3 months

- Balancing is done weekly

 

80%-85% would mean free space of about 2.2TB (which feels a bit like a waste). I will also go and free up more then to find if it has a positive effect.

Message 12 of 71
MaxxMark
Luminary

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

I have conducted loads more of testing and monitoring to find out where the issue is coming from. However the probleem seems to remain.

 

What I have done (mostly in the last week):

- Removed/disabled all services running on the NAS as to prevent extreme loads on the NAS

- Installed zabbix-agent so I can log information about the system health so I can traceback how much the machine is doing at the time of the issues

- Freed up to have between 15% and 20% of free diskspace (1.7TB)

- Restricted access the NAS using the ReadyNAS provided services which are:

 1) Access via NFS

 2) Access via iSCSI (2 raspberry pi's use iSCSI targets on the nas)

 

First some screencaptures from situations I encountered in the last month or so. In those situations they were a typical situation and included everything installed and running.

 

 

 

 

This was a fairly typical scenario. There was nothing to indicate anything being wrong. No high loads, CPU utilization, etc. Just out of the blue it stopped responding (sidenote: ping responses arn't available because they were not monitored yet).

 

 

 

 

 

Same scenario; nothing out of the ordinary untill it stopped responding. Now included ping responses from the internal net work.

 

 

Last week I decided to disable every app or service I had running even though they were not used. These included apps like SabNZBD, SickRage and uTorrent, as well as the mysqld.

 

Which resulted in some decrease in CPU jumps and CPU load (although to me a load reduction from about 0.1 to 0.01 seems insignificant):

 

 

 

After this ran for about 48 hours, I decided to only use iSCSI and NFS again for accessing the NAS. Which ultimately lead to a freeze again today:

 

 

A few things are interesting in this picture:

1) the ping responses (2 peaks which are a couple of minutes). They could however be anomalies of the monitoring tool 

2) the insignificance of what was happening at the time of it going down (insignificance in relation to: network throughput, load on the machine, open connections)

 

I'm really at a loss here. I'm considering there may even be hardware failure (which wouldnt be a stretch giving the age of the unit). I have a few things I'd like to try:

1. using sar to log second-by-second what is happening, maybe what happens happens so abrubtly that it can't be noticed in the 30 to 60 second intervals used by Zabbix (turned this on right now, and hope for the best)

2. Doing a full re-install of the NAS once more

 

I'm really open to any further suggestions. 

 

MDGM noticed that I had "a fair bit of stuff installed, including stuff via SSH".

In response to that: I am aware of this; though mostly they were packages installed via the webinterface. Everything I installed using SSH were mostly support-tools (vim, screen, iostat, ngrep, etc.). Only nautilus (which was the gui version of dropbox instead of the headless version) was due to an error of mine.

If you did however see anything that might be remotely related to the problem, im all ears.

 

Message 13 of 71
FramerV
NETGEAR Employee Retired

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

Hi MaxxMark,

 

I cannot really add anything else to what you have done so far. Having a back-up at this point would be best then upgrade or maybe wait for the beta firmware. If the firmware still fails to fix the issue, then I think its hardware related already.

 

 

Regards,

Message 14 of 71
JennC
NETGEAR Employee Retired

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

Hello MaxxMark,,

 

You may now download 6.7.0-T158 (Beta 1).

 

Regards,

Message 15 of 71
Sandshark
Sensei

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

Sounds a lot like what this other user is experiencing: ReadyNAS-Pro-becomes-unresponsive-will-not-reboot-until/m-p/1230383#U1230383

 

See my response to him and if one of you does figure out the true cause, you can pass it on to the other.

Message 16 of 71
MaxxMark
Luminary

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

So I have an interesting development since last time.

 

As I am, again, working towards doing a full re-init of the unit, I started doing online-backups to an offsite location (which will probably take a while, but time is something we currently have).

 

This poses a significant impact on the NAS' CPU and load (but not so much on bandwidth) and as such I expected the NAS to regularly freeze as it did before.

 

You can see the past 14d in these graphs:

 

 

A few things are noteworthy:

1. througput varies between 20mbps and 50mbps. With occasional spikes (being data transferring to the NAS)

2. CPU jumps skyrocketed from around 1000-1500 to between 10.000 to 20.000

3. CPU load went from 0.1 to an average of above 2 and peaks going well above 3

4. CPU Utilization went to aproximate 100% usage (the blue part was when the backup process ran in normal priority. After the 13th, I reniced the process to let it only work in CPU idetime. (I did the renice after I had to shutdown the backup due to a disk balance which was weekly-schedueled for the 13th)

 

Now to the most interesting and intruiging part... During the time the backup process is running, the NAS never became unresponsive. It ran from 12th to 18th without any interuptions. The 18th I temporaryly shutdown the backup process which I started the 19th again, and turned it off the 20th once more due to the weekly schedueled balancing.

 

Then, at the 21st just passed 9 AM it became unresponsive again.

 

 

And at the time nothing happening which was of interest:

 

 

 

 

That evening I turned on the backup again, and the NAS is running happyly again.

 

It surely doesnt feel lik a coincidence anymore. When running full load (and excessively high) nothing happens at all. But when there is almost nothing going on, problems arise. 

 

As a sidenote; everything else beside the process is identical and keeps running alongside.

 

Regarding the new Beta; I will try it, but as I have read  some reports (and had first hand experience) of volumes becoming  invisible after updating to the newer versions, I want to wait in doing so untill I can backup my NAS completely.

 

Message 17 of 71
Sandshark
Sensei

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

Do you have drive spin-down enabled?  Not that I know why it would make a difference, but maybe try turning it off if you do.  They certainly won't spin down while doing a backup.

Message 18 of 71
MaxxMark
Luminary

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

When reading your respons regarding the disk spin-down, I had high hopes, because I couldn't remember if I explicitly turned it off when upgraded to OS6. If it were that simple I'd be very happy. However it turns out the disk spindown was disabled after all.

 

Just for the sake of trying (I'm willing to try anything right now), I turned it on, and off again.

Message 19 of 71
MaxxMark
Luminary

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

"Finally" my NAS crashed again. This time during a weekend away, so nothing much happened on the NAS at all.

 

As stated in the different topic regarding this subject, my voltages were correct and I decided to log ps, and top outputs (see: https://community.netgear.com/t5/Using-your-ReadyNAS/ReadyNAS-Pro-becomes-unresponsive-will-not-rebo...). As wel as logging SAR output in binary format.

 

It seemed that due to the intensive logging (which was logging every second) the same effect occured as when running my backups. When the system is under stress/load, it crashes less often.

 

The files and images are at the end of the post.

 

My preliminary findings from the logfiles and the statistics gathered by Zabbix are interesting;

1. Logging to Zabbix halted at +- 05:28:20 This is typical behaviour in the crashes

2. The logging from the ps output (which simply logged the current date+time followed by the PS output ) stopped logging to the diskno more than a minute later. The last timestamp reported was 05:29:07

3. The interesting thing is that the  logging from the top output (which simply logged its output, with no additional information) continued far longer than any other. It continued logging to the disk up untill 06:00:48 (almost 30 minutes longer).

 

When looking at the load averages it is interesting to note that looking at the top outputs; there is a decrease in load prior (to 0.12 at its lowest) to the 'crash' which isn't that surprising (could be due to several reasons, and it is not extremely low in comparison to other times). However, áfter the the reporting halted (around 05:29:15) the load starts to increase from 0.25 to load 1.00 at 05:30:05 and increases steadily to load 2.00 at 05.35:26 and continues to rise to 3.08 at 06:00:20 after which it slightly decreases to 3.05 at 06:00:48 after which the logging stopped.

 

When inspecting the top outputs around the highest load moments, the surprising thing is that the system doesn't seem to be doing a lot. It kind of feels like 'fake-load' (like what happens when a unix system has an NFS mount and the NFS server goes away, this creates blocking processes which seem to be generating load, but don't affect the system).  The only interesting thing was seen in the last load 3.08 output:

 

 

I dont exect it to mean much as it doesn't look like most other top outputs. But it is interesting to see the raid6 process as wel as the readynasd process here

 

The rest is mostly the logging processes (top; for top outpu,  sar; for sar binary loggin,  sadc used by sar, screen; used for running the logging processes in the background).

 

The CPU times acquired by the processes don't seem out of the ordinary a well;

the raid6 has the most time (seems logical as it handles all raid6 parity calculations etc)

readynasd as it probably does a lot to govern the system

And the logging items run every second and probably cobble up some time as well

 

 

Maybe someone else sees something interesting, or has any hint/suggestion on what (or how) to monitor next.

 

 

 

I have created two downloads containing the last data from the log files (all files were around 30gb of data).

 

They can be found at:

http://www.maxxmark.com/dropbox/ps.out.sample.gz

http://www.maxxmark.com/dropbox/top.out.sample.gz

 

(I haven't been able to create a satisfactory sample set from the SAR binary output without first converting the 18gb of binary data  to TXT and then sampling the last moments. Ill append it if I have something extra)

 

 

screencapture-zabbix-omniscale-nl-screens-php-1491750664867.png

 

Message 20 of 71
kekegsm
Guide

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

I have same issue.

 

This Highusage>lesscrash idea is interesting. Maybe something with CPU lowpower state?

Message 21 of 71
mdgm-ntgr
NETGEAR Employee Retired

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

Can you send in your logs zip downloaded from the web admin interface or using RAIDar 6.2 (see the Sending logs link in my sig)?

Message 22 of 71
kekegsm
Guide

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

netgear_top.jpgI have it again. Something use the hardware on high, see the "top" command. I have E7600 CPU and 4GB ram

Message 23 of 71
mdgm-ntgr
NETGEAR Employee Retired

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

I'm seeing lots of errors about your USB disk. Perhaps disconnect that and see what difference that makes.

Message 24 of 71
MaxxMark
Luminary

Re: [Pro Pioneer 6.5.1] NAS becomes unresponsive after random interval

FYI: I have just sent the log files

Message 25 of 71
Top Contributors
Discussion stats
  • 70 replies
  • 6855 views
  • 2 kudos
  • 10 in conversation
Announcements