× NETGEAR will be terminating ReadyCLOUD service by July 1st, 2023. For more details click here.
Orbi WiFi 7 RBE973
Reply

Re: 6.4 makes our 516 to lock up during disk balance, need to abort. URGENT!

EKroboter
Apprentice

6.4 makes our 516 to lock up during disk balance, need to abort. URGENT!

The 6.4 firmware updated continues to screw up everything, it's becoming the worst update ever from Netgear.

During a disk balance task, our 516 completely locks up. No frontview access, no SSH, no ping responses. Nothing.

 

After manually restarting the device, the disk balance starts all over again and will lockup the system eventually (sometimes at 40%, others at 62%, it's completely random).

 

I have no way to cancel this job from frontview. I have ssh access but I need furhter instructions.

Message 1 of 11

Re: 6.4 makes our 516 to lock up during disk balance, need to abort. URGENT!

Same here, but as it is not a production unit I reduced the load from other services and it eventually finished syncing.

 

But a few hours later it locked up (because I copied a file from one share to another) and I had to power-shutdown and guess what... the sync began again!!

Message 2 of 11
EKroboter
Apprentice

Re: 6.4 makes our 516 to lock up during disk balance, need to abort. URGENT!

God. Looks like someone is gonna get fired for letting the 6.4 firmware out. It's been nothing but headaches.

Message 3 of 11
EKroboter
Apprentice

Re: 6.4 makes our 516 to lock up during disk balance, need to abort. URGENT!

After canceling the Balance job, performance and responsiveness is back to normal. I have deleted the scheduled defrag, scrub and balance jobs as precaution. 

Message 4 of 11
btaroli
Prodigy

Re: 6.4 makes our 516 to lock up during disk balance, need to abort. URGENT!

Ugh... My 516 has been similarly afflicted. This is inexcusable.

 

Happend during the day today, but I'm not sure this was a scheduled balance job, as I receive no alert email indicating that such a job was starting. It's been running 6.4 for a few days already, and I recall watching the initial updates for quota running just fine. And I happen to know it WAS running a balance because I tend to keep top running in a shell at home.

 

top - 09:21:38 up 5 days, 15:38,  2 users,  load average: 2.21, 2.27, 1.82
Tasks: 225 total,   6 running, 219 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us, 11.4 sy,  0.0 ni, 61.3 id, 27.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem:  16324816 total, 15487560 used,   837256 free,      312 buffers
KiB Swap:  2093052 total,       16 used,  2093036 free, 13655956 cached
packet_write_wait: Connection to 192.168.23.16: Broken pipe
oscar:~ btaroli$ R  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                   
oscar:~ btaroli$ 9  -1 15572 1020  880 R  41.9  0.0   1:55.63 btrfs balance start -dusage 79 -musage 79 /data           
oscar:~ btaroli$ 0   0  226m 9888 7052 D   1.3  0.1   0:03.95 /usr/sbin/afpd -d -F /etc/netatalk/afp.conf               
oscar:~ btaroli$ 0   0 3122m 328m  22m S   1.0  2.1 170:47.62 /apps/dvblink-tv-server/dvblink_server  

So it appears to have lost network connection around 9:20 this morning. 😞 Weird thing is that none of this looks like the box was under any extreme load. The I/O wait time and CPU usage numbers look fine. So this smells like a hang or crash. (sigh)

 

I'm on my second restart now, first was a "soft" down from the front panel but this time I did a full power cycle.

 

The really frustrating thing is that I KNOW how to kill the balance, but I can't get in to the ssh interface to save my life. The box pings after startup but the front panel stays on "Booting...". Control pad lights up twice in the process and disk lights are blinking, so I know it's up to something at least.

 

But with other services trying to fight against the balance to start up, I don't think sshd is getting it's turn. After 10-15 minutes the box stops pinging and I see how appreciable action on the disk activity lights.

 

In lieu of being able to ssh in to shut this crap down, will it actually finish once it gets into this state or is it actually hanging? If I do a FP OS reinstall will this subvert the resumption of the balance at restart? Knowing btrfs from using it on Fedora for a while, I suspet not. But this is lunacy.

 

Do we all need to start opening support tickets for this, or are people on it and adding to the pile isn't going to help?

Message 5 of 11
btaroli
Prodigy

Re: 6.4 makes our 516 to lock up during disk balance, need to abort. URGENT!

OK. Managed to get in. Tried just killing the balance, but it hung almost instantly (and stopped pinging). Rebooted again, waiting for email alert about volume usage over 70% -- never thought I'd find that useful -- and then first stopped all installed apps, waited a bit then THEN killed the balance, and after few minutes it finally gave up and stopped. Balance and defrag jobs have been disabled.

 

I now understand why it died around 9:20, too... The balance job kicked off at 9AM (since I"m never home then on weekdays), but then my Macbook began a Time Machine backup at 9:20. The combination of these two seems to have been what fried it. I can accept /slow/ performance during a BTRFS job, but outright hanging the kernel is a bit much. I'll leave those btrfs jobs disabled for now and hope for a fix to that in the near future.

Message 6 of 11
EKroboter
Apprentice

Re: 6.4 makes our 516 to lock up during disk balance, need to abort. URGENT!

Sounds like we both experienced the same issue. I was lucky enough to be able to cancel the balance at the first try.

I also disabled scheduled jobs and also snapshot creation. These two mundane tasks for the NAS used to be trivial and lightweight, never ever hanging the unit. The only thing that I was complaining before 6.4 was that the performance dropped a bit during backup jobs (which is understandable), but now I celebrate if the unit does not freeze for eight hours straight.

 

My advice to eveyrone is that they enable ssh access inmediately. I don't know how I could have fixed it without it.

Message 7 of 11
btaroli
Prodigy

Re: 6.4 makes our 516 to lock up during disk balance, need to abort. URGENT!

So far I have only disabled balance and defray jobs. Having run btrfs on my own Fedora workstation for quite some time, I've learned that these jobs can introduce a special kind of load on the filessystem that can at times render it nearly unresponsive. What worries me a little is the added task of quota in this release. Even with the very latest kernel and btrfsprogs builds, I found that enabling quota can cause seriously problems, and I ultimately decided having those enabled just wasn't worth the headache. We don't have that option here -- and I do actually like being able to see how much space the snapshots actually consume -- so we'll see how things go from here.

Based on my own experience, I'm leaving snapshots enabled. Indeed even after having all the apps started back up and with a Time Machine backup job running, I noticed btrfs-cleaner kick off (to clean up after scheduled removal of snapshots), and while the process and I/O wait times predictably increased the system remained operational. So I'm not too worried about snapshots.

Given the size of our /data volumes I think we can live without the balance and defray jobs for a while. I tend to run them monthly anyway. But I'd rather not have my NAS crash each month. 😉 I do tend to keep an eye on "btrfs fi sh /", as I've found that there is something on my NAS -- could be an app -- that causes it's extents to get fully allocated even though the filessystem usage is quite normal. This can cause headaches when doing installs or updates of apps, so I tend to check it every few weeks and do a manual balance on / to clean it up. I did one such balance just last night after getting the NAS back up and it ran just fine. Makes me wonder if this new behavior is the result of enabling quota on /data.

I quite agree about enabling ssh access, and I have done this from the old RAIDIator days. But it's not necessarily for everyone and there is a mode the NAS can be started up in that enables Netgear to access the NAS remotely via ssh for support purposes.
Message 8 of 11
EKroboter
Apprentice

Re: 6.4 makes our 516 to lock up during disk balance, need to abort. URGENT!

I haven't enabled disk quotas for our shares, and after all the issues I came across after the update I doubt I ever will. 

I only updated for the option to see just how much space was being consumed by snapshots, which to my surprise was a lot. I had 3.5TB worth of snaphost and only 1.7TB of actual data, so that helped me clear quite a bit of space.  Also, the shares now show how much space they're consuming, which is also nice.

 

Clearing the scheduled balance job brought the performance to normal again, and disabling snaphots has actually improved it a bit. It is now more responsive than before, at least the frontview and file browsing are.

 

I don't have many apps running apart from Anti Virus, just the SMB prefs panel and ReadyNAS Surveillance recording from 9 IP cameras (total bandwith for all of them is  12,030 Kbps so that should never put a load on the CPU. My guess is that someone screwed up in the implementation of the Balance, Quota and Sync features.

 

I'm not a Linux expert, but I don't mind learning and using the terminal to check up on things. The web ui can only get you so far. 

Message 9 of 11
obaeyens
Aspirant

Re: 6.4 makes our 516 to lock up during disk balance, need to abort. URGENT!

On the 104, I have no issue with the OS 6.4.0 so far it works fine. This time with anti virus active but bit rot disabled.

 

The 204 so far had one hang (I have to pull the power plug), and I think that it was during a defrag. (Anti-virus active but no bit-rot)

 

And this morning I found it blinking. It did now want to power down, so I also ahd to pull the plug. When it restarted I saw that it was in a degraded state and had to resync. Itr Resynced and all drives operate normally. So whatever it was, it definately is not the hardware.

 

I have the impression that the issues happens after you transferred lots of data on that drive. No hangs when no big data has been added/removed or moved.

Message 10 of 11
btaroli
Prodigy

Re: 6.4 makes our 516 to lock up during disk balance, need to abort. URGENT!

I thought i was going to get away with leaving snapshots in place, but alas no. The cleaner job that runs regularly as snapshots expire causes a serious drain on performance. so much so that PLEX and ZNC become completely inoperative for a while.

 

So I've now taken the perhaps extreme step of disabling all snapshots, and expunged all existing snapshots. It's easier than you think if you're had to wrestle with it before. 🙂

 

I turned off AV a long while back as it eventually caused DBus timeouts, and just never turned it back on. Given these performance issues in 6.4, I'm inclined to leave all the bells and whistles disabled for now. I am however leaving "bitrot" (COW) enabled. 🙂 After all, without COW what's the point, really, of btrfs? 😉

Message 11 of 11
Top Contributors
Discussion stats
  • 10 replies
  • 3764 views
  • 2 kudos
  • 4 in conversation
Announcements