NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.

Forum Discussion

EKroboter's avatar
EKroboter
Apprentice
Oct 19, 2015

6.4 makes our 516 to lock up during disk balance, need to abort. URGENT!

The 6.4 firmware updated continues to screw up everything, it's becoming the worst update ever from Netgear.

During a disk balance task, our 516 completely locks up. No frontview access, no SSH, no ping responses. Nothing.

 

After manually restarting the device, the disk balance starts all over again and will lockup the system eventually (sometimes at 40%, others at 62%, it's completely random).

 

I have no way to cancel this job from frontview. I have ssh access but I need furhter instructions.

10 Replies

Replies have been turned off for this discussion
  • Same here, but as it is not a production unit I reduced the load from other services and it eventually finished syncing.

     

    But a few hours later it locked up (because I copied a file from one share to another) and I had to power-shutdown and guess what... the sync began again!!

    • EKroboter's avatar
      EKroboter
      Apprentice

      God. Looks like someone is gonna get fired for letting the 6.4 firmware out. It's been nothing but headaches.

      • EKroboter's avatar
        EKroboter
        Apprentice

        After canceling the Balance job, performance and responsiveness is back to normal. I have deleted the scheduled defrag, scrub and balance jobs as precaution. 

  • Ugh... My 516 has been similarly afflicted. This is inexcusable.

     

    Happend during the day today, but I'm not sure this was a scheduled balance job, as I receive no alert email indicating that such a job was starting. It's been running 6.4 for a few days already, and I recall watching the initial updates for quota running just fine. And I happen to know it WAS running a balance because I tend to keep top running in a shell at home.

     

    top - 09:21:38 up 5 days, 15:38,  2 users,  load average: 2.21, 2.27, 1.82
    Tasks: 225 total,   6 running, 219 sleeping,   0 stopped,   0 zombie
    %Cpu(s):  0.3 us, 11.4 sy,  0.0 ni, 61.3 id, 27.0 wa,  0.0 hi,  0.2 si,  0.0 st
    KiB Mem:  16324816 total, 15487560 used,   837256 free,      312 buffers
    KiB Swap:  2093052 total,       16 used,  2093036 free, 13655956 cached
    packet_write_wait: Connection to 192.168.23.16: Broken pipe
    oscar:~ btaroli$ R  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                   
    oscar:~ btaroli$ 9  -1 15572 1020  880 R  41.9  0.0   1:55.63 btrfs balance start -dusage 79 -musage 79 /data           
    oscar:~ btaroli$ 0   0  226m 9888 7052 D   1.3  0.1   0:03.95 /usr/sbin/afpd -d -F /etc/netatalk/afp.conf               
    oscar:~ btaroli$ 0   0 3122m 328m  22m S   1.0  2.1 170:47.62 /apps/dvblink-tv-server/dvblink_server  

    So it appears to have lost network connection around 9:20 this morning. :( Weird thing is that none of this looks like the box was under any extreme load. The I/O wait time and CPU usage numbers look fine. So this smells like a hang or crash. (sigh)

     

    I'm on my second restart now, first was a "soft" down from the front panel but this time I did a full power cycle.

     

    The really frustrating thing is that I KNOW how to kill the balance, but I can't get in to the ssh interface to save my life. The box pings after startup but the front panel stays on "Booting...". Control pad lights up twice in the process and disk lights are blinking, so I know it's up to something at least.

     

    But with other services trying to fight against the balance to start up, I don't think sshd is getting it's turn. After 10-15 minutes the box stops pinging and I see how appreciable action on the disk activity lights.

     

    In lieu of being able to ssh in to shut this crap down, will it actually finish once it gets into this state or is it actually hanging? If I do a FP OS reinstall will this subvert the resumption of the balance at restart? Knowing btrfs from using it on Fedora for a while, I suspet not. But this is lunacy.

     

    Do we all need to start opening support tickets for this, or are people on it and adding to the pile isn't going to help?

    • btaroli's avatar
      btaroli
      Prodigy

      OK. Managed to get in. Tried just killing the balance, but it hung almost instantly (and stopped pinging). Rebooted again, waiting for email alert about volume usage over 70% -- never thought I'd find that useful -- and then first stopped all installed apps, waited a bit then THEN killed the balance, and after few minutes it finally gave up and stopped. Balance and defrag jobs have been disabled.

       

      I now understand why it died around 9:20, too... The balance job kicked off at 9AM (since I"m never home then on weekdays), but then my Macbook began a Time Machine backup at 9:20. The combination of these two seems to have been what fried it. I can accept /slow/ performance during a BTRFS job, but outright hanging the kernel is a bit much. I'll leave those btrfs jobs disabled for now and hope for a fix to that in the near future.

      • EKroboter's avatar
        EKroboter
        Apprentice

        Sounds like we both experienced the same issue. I was lucky enough to be able to cancel the balance at the first try.

        I also disabled scheduled jobs and also snapshot creation. These two mundane tasks for the NAS used to be trivial and lightweight, never ever hanging the unit. The only thing that I was complaining before 6.4 was that the performance dropped a bit during backup jobs (which is understandable), but now I celebrate if the unit does not freeze for eight hours straight.

         

        My advice to eveyrone is that they enable ssh access inmediately. I don't know how I could have fixed it without it.

  • So far I have only disabled balance and defray jobs. Having run btrfs on my own Fedora workstation for quite some time, I've learned that these jobs can introduce a special kind of load on the filessystem that can at times render it nearly unresponsive. What worries me a little is the added task of quota in this release. Even with the very latest kernel and btrfsprogs builds, I found that enabling quota can cause seriously problems, and I ultimately decided having those enabled just wasn't worth the headache. We don't have that option here -- and I do actually like being able to see how much space the snapshots actually consume -- so we'll see how things go from here.

    Based on my own experience, I'm leaving snapshots enabled. Indeed even after having all the apps started back up and with a Time Machine backup job running, I noticed btrfs-cleaner kick off (to clean up after scheduled removal of snapshots), and while the process and I/O wait times predictably increased the system remained operational. So I'm not too worried about snapshots.

    Given the size of our /data volumes I think we can live without the balance and defray jobs for a while. I tend to run them monthly anyway. But I'd rather not have my NAS crash each month. ;) I do tend to keep an eye on "btrfs fi sh /", as I've found that there is something on my NAS -- could be an app -- that causes it's extents to get fully allocated even though the filessystem usage is quite normal. This can cause headaches when doing installs or updates of apps, so I tend to check it every few weeks and do a manual balance on / to clean it up. I did one such balance just last night after getting the NAS back up and it ran just fine. Makes me wonder if this new behavior is the result of enabling quota on /data.

    I quite agree about enabling ssh access, and I have done this from the old RAIDIator days. But it's not necessarily for everyone and there is a mode the NAS can be started up in that enables Netgear to access the NAS remotely via ssh for support purposes.
    • EKroboter's avatar
      EKroboter
      Apprentice

      I haven't enabled disk quotas for our shares, and after all the issues I came across after the update I doubt I ever will. 

      I only updated for the option to see just how much space was being consumed by snapshots, which to my surprise was a lot. I had 3.5TB worth of snaphost and only 1.7TB of actual data, so that helped me clear quite a bit of space.  Also, the shares now show how much space they're consuming, which is also nice.

       

      Clearing the scheduled balance job brought the performance to normal again, and disabling snaphots has actually improved it a bit. It is now more responsive than before, at least the frontview and file browsing are.

       

      I don't have many apps running apart from Anti Virus, just the SMB prefs panel and ReadyNAS Surveillance recording from 9 IP cameras (total bandwith for all of them is  12,030 Kbps so that should never put a load on the CPU. My guess is that someone screwed up in the implementation of the Balance, Quota and Sync features.

       

      I'm not a Linux expert, but I don't mind learning and using the terminal to check up on things. The web ui can only get you so far. 

      • obaeyens's avatar
        obaeyens
        Aspirant

        On the 104, I have no issue with the OS 6.4.0 so far it works fine. This time with anti virus active but bit rot disabled.

         

        The 204 so far had one hang (I have to pull the power plug), and I think that it was during a defrag. (Anti-virus active but no bit-rot)

         

        And this morning I found it blinking. It did now want to power down, so I also ahd to pull the plug. When it restarted I saw that it was in a degraded state and had to resync. Itr Resynced and all drives operate normally. So whatever it was, it definately is not the hardware.

         

        I have the impression that the issues happens after you transferred lots of data on that drive. No hangs when no big data has been added/removed or moved.

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology! 

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

 

Learn More