NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.

Forum Discussion

rraszews's avatar
rraszews
Aspirant
Jan 04, 2015

ReadyNAS 314 nfs issue

Hi,

I've been pretty happy with my ReadyNAS since I got it a year ago, but in the past two weeks, everything suddenly started going up the spout.

I got my RN314 preinstalled with four seagate 3TB barracuda drives. Two weeks ago, one of the hard drives failed. Okay, I figured, that happens. I was a little worried because I didn't have a spare on-hand, so if another one had failed before I could get one shipped, that would be bad of course, but it held out and I was able to rebuild the array and all was well again.

I don't normally check the frontview all that often, so it was only when I had to replace the drive that I saw the OS 6.2.0 upgrade had come out, so once the rebuild was complete, I installed it. All went well for about a week, and then one morning at about 4 AM, it just stopped responding to NFS dead. Nothing in frontview indicated anything was wrong, and toggling the NFS service off and on didn't help. Most of the computers on my network that relied on the RN were pretty much unusable since they'd block waiting for IO whenever something tried to hit the mountpoint. Rebooting solved the problem and all was well for about another week.

Then it happened again. At 4AM. This time I pulled back all the logs. Again, rebooting solved the problem. At this point, I had a sneaking suspicion that the issue was being triggered by my scheduled backup, which occurs at 4AM. I do a backup of all my computers via rsnapshot once a day. I'd set up this backup job about a month earlier, and sure enough, both days that the failure had occurred were days when it was aging off an old weekly backup -- it had worked fine doing the daily backup and aging off old daily backups, but when it tried to age off the oldest weekly backup, something bad happened in the kernel:

BUG: unable to handle kernel paging request at 000000
00265fe124
IP: [<ffffffff880f0498>] __d_lookup+0x78/0x140
PGD 69ac1067 PUD 69ac3067 PMD 0
Oops: 0000 [#1] SMP
CPU 1
Modules linked in: ir_rc6_decoder ir_rc5_decoder ir_nec_decoder ite_cir rc_core rn_gpio vpd(P)
Pid: 2827, comm: nfsd Tainted: P 3.0.101.RNx86_64.3 #1 NETGEAR ReadyNAS 314 /ReadyNAS 314



To try to troubleshoot the nfs issue, I tried disabling my weekly backup rotation and just doing the daily one. All went well the next morning, but the next day, I tried to delete the old backup manually from a root shell on the RN. This time, nfs didn't go down, but I did get a log message that was largely identical to the one above, except that the "comm: nfsd" was replaced by "comm: rm". After rebooting, I installed the 6.2.2 firmware.

Now this morning, it happened again, at 4 AM during the backup, but this time not while anything was being aged off. I do notice one difference in the error logged in the kernel: "BUG: unable to handle kernel paging request at 000000" is replaced by "general protection fault: 0000 [#2] SMP CPU 3". However, from my logs, it looks like nfs ALSO stopped responding for a few minutes around midnight, but then recovered. I had been running a scrub overnight which completed with no problems found.

Bit-rot protection is turned on. SMB, AFP, NFS, ReadyDNLA, RSync, UPNP, HTTP, HTTPS, SNMP and SSH are turned on; Antivirus, FTP and iTunes are turned off. I'm also running the softether VPN server app. I did have one incident where snmp stopped responding for a few minutes, in case that's relevant.

Does anyone know what I can do to get my system stable again? I know there's a lot of moving targets here: the backup thing (Could having a very large number of small files and lots of hard-links be exposing a bug?); the new hard drive (Probably not directly to blame, but, like, maybe pulling the drive jiggled the RAM?); and the OS upgrade. If this sounds like a hardware issue, what can I do to keep the system working while I come up with a solution to hold my data while I get it repaired/replaced? If this is some kind of bug/limitation, how can I work around it? I'm going to move my backups to a different server, but I'm very worried that the issue is progressive and moving the backup is just delaying the inevitable.

Thanks.

9 Replies

Replies have been turned off for this discussion
  • mdgm-ntgr's avatar
    mdgm-ntgr
    NETGEAR Employee Retired
    Can you update to 6.2.2 and then try lowering the NFS thread count?

    Which share are you using with NFS?

    Do you have a backup of data primarily stored on the NAS?

    Welcome to the forum!
  • I think you and I are having this same conversation on twitter. But for the sake of posterity:

    I've lowered the NFS thread count to 2. I'm using a custom share with snapshots turned off and bit-rot protection turned on (At the time, I didn't know what it was. I don't have any objection to turning it off if that helps). I've just created a new share without bit-rot protection or snapshots and will try using that as my backup destination tonight to see what happens.

    Since I'm using about 4.2 TB of space on the nas, local backups aren't really a practical option. When I started having trouble, I set up an offsite cloud backup job, but since I'm limited by residential FIOS speeds, it'll take a few weeks for that to catch up (My most critical data is already backed up, but I hadn't pulled the trigger on backing up my digital media because of the amount of it)
  • Quick update: I haven't had any issues since I switched my backup to the new share, but obviously it will take a few weeks of backups before I've recreated the situation that I had when the problem started (since it seemed like it only started when it was aging off the older backups), so I'll keep monitoring.

    However, I usually do a poll via snmp every half hour to check on the system status and health. Twice now, I've gotten failures pulling READYNASOS-MIB::temperatureValue.2. The vast majority of the time, it works fine. It's probably nothing, but I thought I'd mention it in case it was connected.
  • Oh dear. The problem recurred this morning. Same message as before. It happened about 2 minutes into a large rsync onto the new share that doesn't have bit-rot protection. Since lots of computers on my network use the RN, I can't say for sure if anything else was going on at the same time, but there shouldn't have been any other scheduled tasks occuring then.

    I'm going to reduce the amount of backing up I do for a while and see if there's a particular usage pattern that triggers the issue. Obviously, that's not a good long-term solution, but if I can't directly fix the problem, I at least have to have a workaround while I decide on a replacement.
  • To complicate the matter, I just got another kernel oops while deleting snapshots:


    general protection fault: 0000 [#1] SMP
    CPU 2
    Modules linked in: ir_rc6_decoder ir_rc5_decoder ir_nec_decoder ite_cir rc_core rn_gpio vpd(P)
    Pid: 12863, comm: snapperd Tainted: P 3.0.101.RNx86_64.3 #1 NETGEAR ReadyNAS 314 /ReadyNAS 314
    (...)
    Call Trace:
    [<ffffffff880e3fd8>] ? getname_flags+0x38/0x220
    [<ffffffff880f0594>] d_lookup+0x34/0x60
    [<ffffffff880e4434>] __lookup_hash+0x94/0x1a0
    [<ffffffff880e7005>] ? user_path_parent+0x55/0x90
    [<ffffffff880e4554>] lookup_hash+0x14/0x20
    [<ffffffff880e71ff>] do_unlinkat+0xaf/0x230
    [<ffffffff880dae14>] ? fput+0x164/0x210
    [<ffffffff880d6ee5>] ? filp_close+0x65/0xa0
    [<ffffffff880e8986>] sys_unlinkat+0x16/0x40
    [<ffffffff8888c1fb>] system_call_fastpath+0x16/0x1b



    I notice the call trace for this one is a bit different; usually the top of the stack is __d_lookup, this time it's getname_flags.

    after this, the frontview wouldn't reload until I rebooted.
  • mdgm-ntgr's avatar
    mdgm-ntgr
    NETGEAR Employee Retired
    I think at this point it would be worthwhile for you to backup your data, do a factory reset (wipes all data, settings, everything) and restore your data from backup.
  • I was afraid of that. While I work on getting a big enough backup solution together, is it liable to make a difference if I run a balance or defrag on the volume?

    Also, I saw somewhere on the forums that it's possible to run a test of the physical memory. Is there enough chance of that being the issue to make it worth trying?

    Thanks.
  • mdgm-ntgr's avatar
    mdgm-ntgr
    NETGEAR Employee Retired
    You could try the other things you suggested if you want however I would suggest you backup your data before doing the balance and defrag.
  • Hi. I wanted to give an update on my situation. I'm still waiting for my off-site backup to complete before doing anything irreversible like a factory reset, but in the mean time, I took a look at my backups. I noticed that there was a directory being backed up that contained a pretty epic number of small files (About 14k) that changed around a lot. I thought that maybe such a large number of small files (and, because of the way the backup works, so many hardlinks) might be causing a metadata issue on the filesystem, so I excluded that directory tree from my backup.

    The ReadyNAS was problem-free for twelve days before the next incident. So while my problem isn't solved, it's looking more and more to me like the problem is something specific to what happens during the backup.

    Since turning off snapshots didn't seem to help, my next plan is to switch to using btrfs snapshots for my backups instead of rsnapshot's method of hard links and see what happens.

    If I haven't figured out a proper solution by the time the offsite backup finishes, I still plan to just wipe the thing and start over, but obviously I'd rather figure out why this is happening and have some assurance that it won't just start happening again after I reset.

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology! 

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

 

Learn More