Poor NFS Performance (when all users active?)

dmarshx · ‎2011-04-29

We are migrating off of a homegrown NFS system to a ReadyNas 4200. The hardware capability of the ReadyNas exceeds the perceived capabilities of the exsting system. The data was imported from the old system using rsync. The firmware is 4.2.15 with LACP and Jumbo frames with a Cisco Catalyst 4500 L3 Switch. Things were going well, but our users are starting to complain of 10-15 second "lags" listing folders and sluggishness using a specific program. We have roughly 100 client systems connected via NFS. We have the number of NFS theads set to 16. Unfortunately, the ReadyNas will not allow this to be set higher. The problem only appears to occur during business hours when all users are active. We have not been able to reproduce the issue after hours.

We have network monitoring tools that show the teaming appears to be working. We have 4 other ReadyNas 4200/3200 devices all running 4.2.15 and none of them show this issue. All of the other ReadyNas devices are also using LACP and Jumbo frames. We have a benchmark script that can drive read/write activity well in excess what the users are generating. However this script performs sequential io operations.

Our ReadyNAS logs have numerous error messages about statd and lockd. Additionally there are 4-5 segfaults each day
kernel: message_handler: segfault at 0 ip 00000000f751b043 sp 00000000ffdc1ea4 error 4 in libc-2.7.so[f74a9000+138000]
kernel: statd: server rpc.statd not responding, timed out
kernel: lockd: cannot monitor xxxxxx

On this ReadyNas we are using 4 TB out of 18TB and we have roughly 54 million files. On another ReadyNas we are using 14TB out of 18TB and it has roughly 18 Million files.

Our sniffer traces running our benchmark script, show that the tcp window sizes appear smaller that we might have expected.

One thought we have is the number of NFS threads may be too low. However, we have been unwilling to install the rootssh addon and change the threads to 32 or 64.

Has anyone seen something similar? Does anyone have any ideas that we should investigate?

Thanks,
David

WhoCares_ · ‎2011-04-30

NFS lags often occur when you have too many files in one directory. Since you seem to have a lot of files, that certainly would be a point to investigate. In addition, a lot of files in the same directory cause excessive RAM usage which may relate to the segfaults you're seeing.

-Stefan

dmarshx · ‎2011-05-02

SNMP access to the CPU and MEMORY counters show plenty of free RAM and low CPU Usage.

The NFS lags being reported seem to be random and not occur all of the time. We will attempt to determine if this is only happening with folders that have many files.

In your environment what constitutes "too many files"? I've seen 32K listed as a limit, but never any number less than that.

WhoCares_ · ‎2011-05-02

dmarshx wrote:
SNMP access to the CPU and MEMORY counters show plenty of free RAM and low CPU Usage.

That may well be for the task may still be "busy" with looking up inodes on the file system.

dmarshx wrote:
In your environment what constitutes "too many files"? I've seen 32K listed as a limit, but never any number less than that.

I *try* to have our developers build applications in such a way as to never create more than ~2048 files per directory. Actually, the inner workings behind the scenes are a lot more complex than that. In reality, on an EXTx filesystem, you define the number of inodes and with that the number of expected dir entries on creation of the file system. The system then creates "space" for the estimated amount of files/dirs for faster indexing. If you cross the initial limits, creation of files and dirs is still possible, but instead of the reserved blocks secondary areas of the hard-disk will be used. The net effect is that after a certain point, additional and thus "costly" seeking of the hard drive's heads occurs. Especially in multiuser environments this can get problematic pretty fast. So my current guess is that your NFS acts the way it does because you crossed a boundary on the ReadyNAS' local filesystem. But of course that's just a guess.

-Stefan

dmarshx · ‎2011-05-02

ReadyNas support has asked us to try and Disable LAG to the READYNas, Teaming and Jumbo Frames on the ReadyNas. We are reluctant to do so:

No network errors on any switches or devices involved

MRTG Reports show that we can drive read/write bandwith accross the team to 75% of Lan Maximum

We were leaning towards NFS threads, because we had 10 and received warnings. After we bumpped it to 16 we began to see the statd, lockd and segfaults, We have 100 nfs clients. The problem only appears when all users are active. But I think we will check # files ion folder first to see if we can uncover something.

qarce · ‎2013-01-11

Hello,

Did your problem ever get fixed?

I am having the same issue with my ReadyNAS Ultra 4 Plus [X-RAID2]

statd: server rpc.statd not responding, timed out
lockd: cannot monitor somehost.com
hrtimer: interrupt took 15363 ns
rpc.statd[2121]: segfault at a ip 00000000f774a4ae sp 00000000ff9af0bc error 4 in libc-2.7.so[f765a000+138000]

Did NetGear ever find the root cause and fix it?
I plan on purchasing a few more of these boxes.... IF this is fixed.

Poor NFS Performance (when all users active?)

Poor NFS Performance (when all users active?)

Re: Poor NFS Performance (when all users active?)

Re: Poor NFS Performance (when all users active?)

Re: Poor NFS Performance (when all users active?)

Re: Poor NFS Performance (when all users active?)

Re: Poor NFS Performance (when all users active?)