Re: Ultra 2 randomly becomes unavailable (Case #20265076)

blumman · ‎2013-01-03

Hi,

I'm not sure if this is the right section for this post so forgive me if incorrectly posted it here.

I have an annoying problem with one of my ReadyNAS boxes. It randomly goes off the network and requires a reboot to come back online. It has done that since I bought it in July last year. The OS does not lock up but eth0, the NIC seems to stop responding after X amount of days. I know that the OS is not the issue because if I connect the Ethernet cable to LAN2 the box gets a DHCP IP and comes back online again. I'm using this box for off-site replication and have now picked it up for further troubleshooting locally.

Here's some basic info about the NAS:

Model: ReadyNAS Ultra 2 [X-RAID2], RNDU2000
Firmware: RAIDiator 4.2.22
Memory: 1024 MB [DDR3]
HDD: Seagate ST2000DL003-9VT166 1863 GB
Installed Add-Ons: EnableRsyncSsh, Htop, Istatd & ReadyNAS Remote.

The NAS is connected to a DIR-615 (10/100 Mbits) router running DD-WRT with a 10/10 Mbits connection. It's monitored by Cacti with SNMP.

The first time I noticed it was when the box was running on DHCP. It had then received a DHCP scope that isn't defined on the network. I had seen something similar with my ReadyNAS Ultra 4 Plus prior during start-up. The Ultra 4 would during start-up receive a DHCP from a different scope than I have defined on the network causing the box to go offline (note I only have one DHCP server). I resolved this by giving the Ultra 4 Plus a static IP. I tried the same approach with the Ultra 2 but the issue with the Ultra 2 is different as is goes offline randomly as where the Ultra 4 Plus would only go offline during a reboot.

So far I have tried to factory default the NAS and giving it a static IP which didn't help. I have "rooted" the box to be able to read the logs. Last time the box reported back to Cacti was at 07:00 AM on January 3rd. The Syslog still logs SNMP connections from the Cacti server for another 1hr 20 min, then it goes silent for 1hr 50 min before this happens:


Jan  3 10:08:08 REPLICATE kernel:  [<ffffffff8807c745>] ? unlock_page+0x25/0x30
Jan  3 10:08:08 REPLICATE kernel:  [<ffffffff880930b9>] ? __do_fault+0x369/0x490
Jan  3 10:08:08 REPLICATE kernel:  [<ffffffff88094f8c>] ? handle_mm_fault+0x1ac/0x840
Jan  3 10:08:08 REPLICATE kernel:  [<ffffffff884b8525>] ? sockfd_lookup_light+0x45/0x80
Jan  3 10:08:08 REPLICATE kernel:  [<ffffffff88020200>] ? do_page_fault+0x200/0x420
Jan  3 10:08:08 REPLICATE kernel:  [<ffffffff880390d4>] ? timespec_add_safe+0x34/0x70
Jan  3 10:08:08 REPLICATE kernel:  [<ffffffff880bdb89>] ? poll_select_set_timeout+0x79/0x90
Jan  3 10:08:08 REPLICATE kernel:  [<ffffffff880be3e7>] sys_poll+0x77/0xf0
Jan  3 10:08:08 REPLICATE kernel:  [<ffffffff88024c53>] ia32_sysret+0x0/0x5
Jan  3 10:08:08 REPLICATE kernel: Code: 49 83 c4 0c 4d 8d 3c c4 4d 39 fc 0f 85 8a 00 00 00 e9 a2 00 00 00 48 8d 75 cc e8 9f 1a ff ff 48 85 c0 49 89 c5 0f
84 23 01 00 00 <48> 8b 40 20 48 85 c0 0f 84 06 01 00 00 48 83 78 38 00 0f 84 fb
Jan  3 10:08:08 REPLICATE kernel: RIP  [<ffffffff880be10d>] do_sys_poll+0x1ed/0x450
Jan  3 10:08:08 REPLICATE kernel:  RSP <ffff8800395e3b48>
Jan  3 10:08:08 REPLICATE kernel: ---[ end trace fc544a38a5c2517e ]---
Jan  3 10:08:11 REPLICATE kernel: general protection fault: 0000 [#2] SMP
Jan  3 10:08:11 REPLICATE kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/removable
Jan  3 10:08:11 REPLICATE kernel: CPU 0
Jan  3 10:08:11 REPLICATE kernel: Modules linked in: pvgpio nv6vpd(P)
Jan  3 10:08:11 REPLICATE kernel:
Jan  3 10:08:11 REPLICATE kernel: Pid: 26939, comm: exim Tainted: P      D     2.6.37.6.RNx86_64.2.4 #1 NETGEAR ReadyNAS/
Jan  3 10:08:11 REPLICATE kernel: RIP: 0010:[<ffffffff880ac8c7>]  [<ffffffff880ac8c7>] filp_close+0x17/0x90
Jan  3 10:08:11 REPLICATE kernel: RSP: 0000:ffff8800395e38f8  EFLAGS: 00010282
Jan  3 10:08:11 REPLICATE kernel: RAX: ffff8800015aa808 RBX: af1c28cdc1b721a0 RCX: 0000000000000063
Jan  3 10:08:11 REPLICATE kernel: RDX: 0000000000000000 RSI: ffff88003958c2c0 RDI: af1c28cdc1b721a0
Jan  3 10:08:11 REPLICATE kernel: RBP: ffff8800395e3918 R08: ffff8800395e2000 R09: 0000000000000000
Jan  3 10:08:11 REPLICATE kernel: R10: 0000000000000010 R11: 0000000000000000 R12: ffff88003958c2c0
Jan  3 10:08:11 REPLICATE kernel: R13: 0000000000000000 R14: ffff880038e729c0 R15: 0000000000000008
Jan  3 10:08:11 REPLICATE kernel: FS:  0000000000000000(0000) GS:ffff88003fa00000(0000) knlGS:0000000000000000
Jan  3 10:08:11 REPLICATE kernel: CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
Jan  3 10:08:11 REPLICATE kernel: CR2: 00000000f7110000 CR3: 0000000038f56000 CR4: 00000000000006f0
Jan  3 10:08:11 REPLICATE kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan  3 10:08:11 REPLICATE kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan  3 10:08:11 REPLICATE kernel: Process exim (pid: 26939, threadinfo ffff8800395e2000, task ffff88003e12d550)
Jan  3 10:08:11 REPLICATE kernel: Stack:
Jan  3 10:08:11 REPLICATE kernel:  ffff880039bec700 000000000000000f ffff88003958c2c0 0000000000000000
Jan  3 10:08:11 REPLICATE kernel:  ffff8800395e3958 ffffffff88036843 0000000000000000 ffff88003e12d550
Jan  3 10:08:11 REPLICATE kernel:  ffff88003958c2c0 ffff88003e12d550 000000000000012c ffff880038e78f00
Jan  3 10:08:11 REPLICATE kernel: Call Trace:
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff88036843>] put_files_struct+0xc3/0xd0
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff88036895>] exit_files+0x45/0x50
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff88037a20>] do_exit+0x190/0x770
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff88002bce>] ? apic_timer_interrupt+0xe/0x20
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff88035fd0>] ? kmsg_dump+0x110/0x160
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff88006506>] oops_end+0xa6/0xb0
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff88006606>] die+0x56/0x90
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff880040f2>] do_general_protection+0x152/0x160
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff885b48ef>] general_protection+0x1f/0x30
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff880be10d>] ? do_sys_poll+0x1ed/0x450
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff880be101>] ? do_sys_poll+0x1e1/0x450
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff880bdcc0>] ? __pollwait+0x0/0x100
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff880bddc0>] ? pollwake+0x0/0x70
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff885197c4>] ? inet_sendmsg+0x84/0xc0
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff884b84b3>] ? sock_sendmsg+0xe3/0x110
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff882bc53e>] ? radix_tree_lookup_slot+0xe/0x10
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff8807c745>] ? unlock_page+0x25/0x30
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff880930b9>] ? __do_fault+0x369/0x490
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff88094f8c>] ? handle_mm_fault+0x1ac/0x840
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff884b8525>] ? sockfd_lookup_light+0x45/0x80
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff88020200>] ? do_page_fault+0x200/0x420
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff880390d4>] ? timespec_add_safe+0x34/0x70
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff880bdb89>] ? poll_select_set_timeout+0x79/0x90
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff880be3e7>] sys_poll+0x77/0xf0
Jan  3 10:08:11 REPLICATE kernel:  [<ffffffff88024c53>] ia32_sysret+0x0/0x5
Jan  3 10:08:11 REPLICATE kernel: Code: 83 80 00 00 00 48 8b 1c 24 4c 8b 64 24 08 c9 c3 66 66 66 90 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c 89 65 f0 48 89
fb 4c 89 6d f8 <48> 8b 47 30 49 89 f4 48 85 c0 74 52 48 8b 47 20 48 85 c0 74 44
Jan  3 10:08:11 REPLICATE kernel: RIP  [<ffffffff880ac8c7>] filp_close+0x17/0x90
Jan  3 10:08:11 REPLICATE kernel:  RSP <ffff8800395e38f8>
Jan  3 10:08:11 REPLICATE kernel: ---[ end trace fc544a38a5c2517f ]---
Jan  3 10:08:11 REPLICATE kernel: Fixing recursive fault but reboot is needed!

Right after this Syslog reports SNMP connections from the Cacti server again but nothing is graphed on the Cacti server. Maybe because the amount of requests are irregular and have been reduced to 1-2 per second instead of 20 every five minutes as it was before the "crash". Is this some kind of SNMP DDoS attack from my Cacti server?

Worth to mention is that the Ultra 4 Plus does not have this issue.

I'd appreciate any help or pointers so that I can get this issue fixed.

chirpa · ‎2013-01-03

I thought that issue was resolved back around 4.2.17, guess you have some corner case that can still encounter it. You will want to create a support case, make sure it gets to L3.

blumman · ‎2013-01-03

chirpa wrote:
I thought that issue was resolved back around 4.2.17, guess you have some corner case that can still encounter it. You will want to create a support case, make sure it gets to L3.

A ticket has been created, lets see what happens. 🙂

Thomas_xh · ‎2013-01-09

Because I have not fuul syslog, so I have following questions, and I will give you my opinion
1. If the NAS connected to a 100M/Full NIC in switch,
please changed it to 1000M NIC, some 100M/Full connection would caused the system unstable;

2. Please run the memory test for this unit.

3. You can send a PM to me, and attached /var/log/syslog and /var/log/kern.log

blumman · ‎2013-01-10

Hi,

One of your colleagues was provided with the full system log via e-mail. I have now attached it to the case.

1. That needs an additional investment from my side and that is something I will consider if there is no other solution. However, the NAS is currently connected to a 10/100/1000 switch as it isn't at its original location at the moment.

2.

memtester version 4.1.3 (32-bit)
Copyright (C) 2010 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffff000
want 800MB (838860800 bytes)
got  800MB (838860800 bytes), trying mlock ...locked.
Loop 1/5:
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok
  Block Sequential    : ok
  Checkerboard        : ok
  Bit Spread          : ok
  Bit Flip            : ok
  Walking Ones        : ok
  Walking Zeroes      : ok

Loop 2/5:
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok
  Block Sequential    : ok
  Checkerboard        : ok
  Bit Spread          : ok
  Bit Flip            : ok
  Walking Ones        : ok
  Walking Zeroes      : ok

Loop 3/5:
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok
  Block Sequential    : ok
  Checkerboard        : ok
  Bit Spread          : ok
  Bit Flip            : ok
  Walking Ones        : ok
  Walking Zeroes      : ok

Loop 4/5:
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok
  Block Sequential    : ok
  Checkerboard        : ok
  Bit Spread          : ok
  Bit Flip            : ok
  Walking Ones        : ok
  Walking Zeroes      : ok

Loop 5/5:
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok
  Block Sequential    : ok
  Checkerboard        : ok
  Bit Spread          : ok
  Bit Flip            : ok
  Walking Ones        : ok
  Walking Zeroes      : ok

Done.

3. See my first sentence.

If you need something else just let me know.

Thanks

toreric · ‎2013-02-13

I have the same problem with an Ultra 4 since 2 months (its age 5 months), thought it was a cable issue, switched to a better, but now again! That is, randomly (often long interval) nic lockout making the nas invisible and unavailable on the network. Only cure is manual reboot. Seems very similar, maybe identical, to your Ultra 2 issue.

Did you find a solution?

blumman · ‎2013-02-13

Hi toreric,

Yes the solution, or should I say workaround that was suggested by Netgears support team seems to have resolved my problem. Their suggestion was to upgrade the 10/100 router to a 10/100/1000 router. I now have an uptime of 14 days so far.

toreric · ‎2013-02-13

Thanks for that. Hope it will last even two months and years -- please indicate with a comment here as soon as you get another nic lockout, God forbid ...

Re: Ultra 2 randomly becomes unavailable (Case #20265076)

Ultra 2 randomly becomes unavailable (Fixed)

Re: Ultra 2 randomly becomes unavailable (reboot needed)

Re: Ultra 2 randomly becomes unavailable (reboot needed)