NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.
Forum Discussion
jlehtinen
Mar 29, 2013Aspirant
ESXi reports "All Paths Down" for ReadyNAS hosted NFS share
Hiya - looking for some feedback from the community on an issue I'm seeing. Thanks in advance for any insights.
Some background:
We're using two ReadyNAS 3200's to host virtual machines via NFS.
ESXi hosts are running ESXi 5.1.
The ReadyNAS units are running 4.2.19, and have "adaptive load balancing" set on the NICs.
Issue:
I'm seeing some of the ESXi hosts report that NFS shares enter "All Paths Down" state for 6-7 seconds, before exiting this status and reconnecting. This happens for BOTH ReadyNAS units, and on 9 ESXi hosts - with no solid pattern on which host is impacted OR which ReadyNAS shows as "All Paths Down". It DOES appear to be related to the current load on the ReadyNAS. For example, if I start a backup job, I can expect to see this error on 3-4 ESXi hosts at least. I believe this has been happening for awhile without anyone noticing - but it caused a HUGE issue 2 weeks ago, when one of the ReadyNAS units entered/exited "All Paths Down" state nonstop while backups were running. (I opened a support case with Netgear and submitted the logs but they could not explain why this happened.)
Current theory:
From what I can tell, adaptive load balancing causes the ReadyNAS to change what MAC address (and NIC) is receiving traffic for a certain percentage of the overall traffic. It's my guess that when I run backups (or do anything else load intensive), the ReadyNAS attempts to load balance some of the traffic going to the ESXi hosts. The resulting change to the MAC address being reported to the ESXi host causes ESXi to report "all paths down" briefly before the new MAC address/NIC resolves correctly.
The issue we experienced must have been due to a glitch or bug in the load balancing, which caused the ReadyNAS to fail to "stabilize" the load balancing correctly. I was only able to stabilize the unit by power cycling it.
Questions:
1.) Does this sound like a plausible theory? My current thinking is I should disable load balancing and go to active-backup configuration to see if this resolves the issue.
2.) Will a firmware update resolve this issue? I reviewed the firmware patch notes and none of them mention NFS stability with NIC teaming.
Some background:
We're using two ReadyNAS 3200's to host virtual machines via NFS.
ESXi hosts are running ESXi 5.1.
The ReadyNAS units are running 4.2.19, and have "adaptive load balancing" set on the NICs.
Issue:
I'm seeing some of the ESXi hosts report that NFS shares enter "All Paths Down" state for 6-7 seconds, before exiting this status and reconnecting. This happens for BOTH ReadyNAS units, and on 9 ESXi hosts - with no solid pattern on which host is impacted OR which ReadyNAS shows as "All Paths Down". It DOES appear to be related to the current load on the ReadyNAS. For example, if I start a backup job, I can expect to see this error on 3-4 ESXi hosts at least. I believe this has been happening for awhile without anyone noticing - but it caused a HUGE issue 2 weeks ago, when one of the ReadyNAS units entered/exited "All Paths Down" state nonstop while backups were running. (I opened a support case with Netgear and submitted the logs but they could not explain why this happened.)
Current theory:
From what I can tell, adaptive load balancing causes the ReadyNAS to change what MAC address (and NIC) is receiving traffic for a certain percentage of the overall traffic. It's my guess that when I run backups (or do anything else load intensive), the ReadyNAS attempts to load balance some of the traffic going to the ESXi hosts. The resulting change to the MAC address being reported to the ESXi host causes ESXi to report "all paths down" briefly before the new MAC address/NIC resolves correctly.
The issue we experienced must have been due to a glitch or bug in the load balancing, which caused the ReadyNAS to fail to "stabilize" the load balancing correctly. I was only able to stabilize the unit by power cycling it.
Questions:
1.) Does this sound like a plausible theory? My current thinking is I should disable load balancing and go to active-backup configuration to see if this resolves the issue.
2.) Will a firmware update resolve this issue? I reviewed the firmware patch notes and none of them mention NFS stability with NIC teaming.
22 Replies
Replies have been turned off for this discussion
- StephenBGuru - Experienced User
First of all, I don't buy that response. You shouldn't need to rebuild the RAID array under normal circumstances. As you point out, there is a lot of down time involved. Proposing it as routine maintenance seems crazy to me.jlehtinen wrote: ...I was also told by tech support that a complete re-build on these units is expected to be completed every 1-2 years. This was news to me. I think they switched from MBR to GPT partitions, but the disks maintain the MBR partitioning unless you do a factory reset and re-build the array.
I've got a plan in place to try a full rebuild on one of our units, but it's going to take awhile. I don't know how you could do a yearly re-build on these unless you keep one storage array as a "spare" and use it to store data while you are re-building another array.
Though if you want to protect your data from loss, you should have a backup plan in place, so it would always be possible to rebuild the NAS if need be. Backing up to a second NAS is one approach, backing up to a cloud service is also possible. I do both (the cloud gives me good disaster recovery, but I am not sure I can count on it).
There are certainly some cases where a factory reset / rebuild is needed (or a good idea). There's a helpful article on the subject here: http://www.rnasguide.com/2011/06/22/why ... -readynas/ - jlehtinenAspirantYeah, I'm unsure how I feel about it. If it's true, than it undermines the concept of the ReadyNAS brand being enterprise grade.
The tech I spoke to said that the yearly rebuild is recommended for data center environments, or when the storage is under consistent heavy load. If I remember correctly he said it's recommended because the arrays pick up file system corruption that causes performance and stability issues over time. If you're just using the ReadyNAS to store relatively static data and your I/O load is low, maybe it's a different story.
The article you linked is good - it also notes that a factory re-set and full rebuild is needed to take advantage of most of the 'major' firmware upgrades. To stay on the topic of this thread (NFS instability), I'm curious to see if a full factory reset on newest firmware will resolve the issue. - jlehtinenAspirantFYI - I set up a NetApp FAS2220 in our environment and do not experience this issue on that storage. This seems to indicate it's an issue specific to the ReadyNAS units. I'm currently in process of RMA'ing several drives, and then will complete a full rebuild on the ReadyNAS to see if this corrects the issue.
- jlehtinenAspirantAs an update...
I completed a full rebuild on my 2 ReadyNAS 3200's. This included replacing any drives with errors, upgrading the firmware to the newest version, resetting the device to factory defaults, then re-configuring the volume and shares.
Since completing this, I have noticed fewer instances of this error. It no longer happens at 'random', or when I'm running backups. This is an improvement.
However, there are still issues whenever the arrays complete a RAID scrub. The last time a scrub ran, my monitoring software registered 200-300 "storage is not accessible" events on the ESXi hosts. These stopped immediately once the RAID scrub finished. This seems to indicate that the arrays are not able to maintain consistent network connectivity while under the load caused by the scrub.
So far this flakey connectivity has not caused data loss or corruption in my environment. However, it means I have to hand-hold the environment whenever a RAID scrub is scheduled.
IMO, these units are not stable enough for a VMWare environment. I've had zero issues with my NetApp FAS2220 that is using the exact same infrastructure and is actually running a heavier load. - gavindAspirantHi jlehtinen, just to confirm, currently, what hardware are you using for a VMware environment? Coz I'm currently shopping around.
- cyrill1AspirantHi jlehtinen! Did you manage to sort out the problem? I am experiencing the same with RN4220s/ESXi 5.5u1 - in my case it gets unavailable under random circumstances. Moreover, the device management web-interface isn't available (the device is offline) and ssh sessions drop just after I provide correct logon credentials..
- jlehtinenAspirantSorry I didn't see your posts... I haven't been around these forums in a long time.
@gavind:
For storage I've got 2x ReadyNAS 3200's, 1 NetApp FAS2220, and 1 NetApp FAS2520.
Personally I like the NetApps better. They have a ton of functionality even w/ basic licensing. You get tons of data on performance, usage, and other metrics without having to install hackjob 3rd party mods or other weird crap. There's much better support and detailed documentation. NetApp also treats their product like it's enterprise grade, so you won't get techs recommending you should reboot a production storage array in the middle of business hours. :wink:
To be fair, ReadyNAS does some things well, and they're cheaper... so it all depends on if you want a rock-solid platform, or if you need to save some $$$.
@cyrill:
No, status hasn't changed since my last post. The ReadyNAS 3200's are stable unless there's a RAID scrub running, and then they get flakey. It hasn't caused issues yet, ESXi seems to be able to cope with the I/O dropping in/out, and I haven't seen data loss yet. I'm not happy about the situation but there's nothing I can do as long as I need to keep the units running. For your case, I think you might have some other issue, as my problems were all related to connectivity loss while the unit was under high load. You might have problems with a bad NIC, cable, etc. - mdgm-ntgrNETGEAR Employee RetiredFor those of you with OS6 NAS Units which version of ESXi are you running now? Can you downgrade to build 1331820 if you are running a newer version?
Is this RAID scrub issue you are facing on a 3220 or a 3200? What firmware version are you running (version number please)? - jlehtinenAspirantBoth units are ReadyNAS 3200. I have one on 4.2.26, other is still 4.2.24. Both have same issue.
- mdgm-ntgrNETGEAR Employee Retiredjlehtinen what build of ESXi are you using?
Related Content
NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology!
Join Us!