ReadyNAS 314 nfs issue

Question

Hi,I've been pretty happy with my ReadyNAS since I got it a year ago, but in the past two weeks, everything suddenly started going up the spout.I got my RN314 preinstalled with four seagate 3TB barracuda drives. Two weeks ago, one of the hard drives failed. Okay, I figured, that happens. I was a little worried because I didn't have a spare on-hand, so if another one had failed before I could get one shipped, that would be bad of course, but it held out and I was able to rebuild the array and all was well again.I don't normally check the frontview all that often, so it was only when I had to replace the drive that I saw the OS 6.2.0 upgrade had come out, so once the rebuild was complete, I installed it. All went well for about a week, and then one morning at about 4 AM, it just stopped responding to NFS dead. Nothing in frontview indicated anything was wrong, and toggling the NFS service off and on didn't help. Most of the computers on my network that relied on the RN were pretty much unusable since they'd block waiting for IO whenever something tried to hit the mountpoint. Rebooting solved the problem and all was well for about another week.Then it happened again. At 4AM. This time I pulled back all the logs. Again, rebooting solved the problem. At this point, I had a sneaking suspicion that the issue was being triggered by my scheduled backup, which occurs at 4AM. I do a backup of all my computers via rsnapshot once a day. I'd set up this backup job about a month earlier, and sure enough, both days that the failure had occurred were days when it was aging off an old weekly backup -- it had worked fine doing the daily backup and aging off old daily backups, but when it tried to age off the oldest weekly backup, something bad happened in the kernel:BUG: unable to handle kernel paging request at 00000000265fe124IP: [&lt;ffffffff880f0498&gt;] __d_lookup+0x78/0x140PGD 69ac1067 PUD 69ac3067 PMD 0 Oops: 0000 [#1] SMP CPU 1 Modules linked in: ir_rc6_decoder ir_rc5_decoder ir_nec_decoder ite_cir rc_core rn_gpio vpd(P)Pid: 2827, comm: nfsd Tainted: P             3.0.101.RNx86_64.3 #1 NETGEAR ReadyNAS 314         /ReadyNAS 314     To try to troubleshoot the nfs issue, I tried disabling my weekly backup rotation and just doing the daily one. All went well the next morning, but the next day, I tried to delete the old backup manually from a root shell on the RN. This time, nfs didn't go down, but I did get a log message that was largely identical to the one above, except that the "comm: nfsd" was replaced by "comm: rm". After rebooting, I installed the 6.2.2 firmware.Now this morning, it happened again, at 4 AM during the backup, but this time not while anything was being aged off. I do notice one difference in the error logged in the kernel: "BUG: unable to handle kernel paging request at 000000" is replaced by "general protection fault: 0000 [#2] SMP CPU 3". However, from my logs, it looks like nfs ALSO stopped responding for a few minutes around midnight, but then recovered. I had been running a scrub overnight which completed with no problems found. Bit-rot protection is turned on. SMB, AFP, NFS, ReadyDNLA, RSync, UPNP, HTTP, HTTPS, SNMP and SSH are turned on; Antivirus, FTP and iTunes are turned off. I'm also running the softether VPN server app. I did have one incident where snmp stopped responding for a few minutes, in case that's relevant. Does anyone know what I can do to get my system stable again? I know there's a lot of moving targets here: the backup thing (Could having a very large number of small files and lots of hard-links be exposing a bug?); the new hard drive (Probably not directly to blame, but, like, maybe pulling the drive jiggled the RAM?); and the OS upgrade.  If this sounds like a hardware issue, what can I do to keep the system working while I come up with a solution to hold my data while I get it repaired/replaced? If this is some kind of bug/limitation, how can I work around it? I'm going to move my backups to a different server, but I'm very worried that the issue is progressive and moving the backup is just delaying the inevitable.Thanks.

mdgm-ntgr · Answer

Can you update to 6.2.2 and then try lowering the NFS thread count?Which share are you using with NFS?Do you have a backup of data primarily stored on the NAS?Welcome to the forum!

rraszews · Answer

I think you and I are having this same conversation on twitter. But for the sake of posterity:

I've lowered the NFS thread count to 2. I'm using a custom share with snapshots turned off and bit-rot protection turned on (At the time, I didn't know what it was. I don't have any objection to turning it off if that helps). I've just created a new share without bit-rot protection or snapshots and will try using that as my backup destination tonight to see what happens.

Since I'm using about 4.2 TB of space on the nas, local backups aren't really a practical option. When I started having trouble, I set up an offsite cloud backup job, but since I'm limited by residential FIOS speeds, it'll take a few weeks for that to catch up (My most critical data is already backed up, but I hadn't pulled the trigger on backing up my digital media because of the amount of it)

rraszews · Answer

Quick update: I haven't had any issues since I switched my backup to the new share, but obviously it will take a few weeks of backups before I've recreated the situation that I had when the problem started (since it seemed like it only started when it was aging off the older backups), so I'll keep monitoring.

However, I usually do a poll via snmp every half hour to check on the system status and health. Twice now, I've gotten failures pulling READYNASOS-MIB::temperatureValue.2. The vast majority of the time, it works fine. It's probably nothing, but I thought I'd mention it in case it was connected.

rraszews · Answer

Oh dear. The problem recurred this morning. Same message as before. It happened about 2 minutes into a large rsync onto the new share that doesn't have bit-rot protection. Since lots of computers on my network use the RN, I can't say for sure if anything else was going on at the same time, but there shouldn't have been any other scheduled tasks occuring then.

I'm going to reduce the amount of backing up I do for a while and see if there's a particular usage pattern that triggers the issue. Obviously, that's not a good long-term solution, but if I can't directly fix the problem, I at least have to have a workaround while I decide on a replacement.

rraszews · Answer

To complicate the matter, I just got another kernel oops while deleting snapshots:


general protection fault: 0000 [#1] SMP 
CPU 2 
Modules linked in: ir_rc6_decoder ir_rc5_decoder ir_nec_decoder ite_cir rc_core rn_gpio vpd(P)
Pid: 12863, comm: snapperd Tainted: P             3.0.101.RNx86_64.3 #1 NETGEAR ReadyNAS 314          /ReadyNAS 314
(...)
Call Trace:
 [<ffffffff880e3fd8>] ? getname_flags+0x38/0x220
 [<ffffffff880f0594>] d_lookup+0x34/0x60
 [<ffffffff880e4434>] __lookup_hash+0x94/0x1a0
 [<ffffffff880e7005>] ? user_path_parent+0x55/0x90
 [<ffffffff880e4554>] lookup_hash+0x14/0x20
 [<ffffffff880e71ff>] do_unlinkat+0xaf/0x230
 [<ffffffff880dae14>] ? fput+0x164/0x210
 [<ffffffff880d6ee5>] ? filp_close+0x65/0xa0
 [<ffffffff880e8986>] sys_unlinkat+0x16/0x40
 [<ffffffff8888c1fb>] system_call_fastpath+0x16/0x1b

I notice the call trace for this one is a bit different; usually the top of the stack is __d_lookup, this time it's getname_flags.

after this, the frontview wouldn't reload until I rebooted.

Forum Discussion

ReadyNAS 314 nfs issue

9 Replies

Related Content

ReadyNAS 314 - inactive volumes error.

BTRFS error ReadyNAS 314, OS6.10.8

ReadyNAS 314 remote access

ReadyNas 314 Firmware Download

Readynas 314 Virenschutz

NETGEAR Academy

ProSupport for Business