× NETGEAR will be terminating ReadyCLOUD service by July 1st, 2023. For more details click here.
Orbi WiFi 7 RBE973
Reply

Aborting a balance

Piglet
Luminary

Aborting a balance

I'm running 6.2.2 on a RN104 with 3x3TB and scheduled a balance a couple of days ago.

I had 25GB free, which seems to cause the balance to take a very long time.

After about 20h the display still said "0% done", so I figured I'd restart and try to make some more free space available.

However, after selecting Restart from the web interface I'm now stuck at the "See you soon" display because the system seems to wait with the restart until the balance is done. 😞

All services seems to have shut down. I can't access the NAS via ssh or file sharing any more. But the disks are working hard, so it's not locked up.

I've now waited 48h after the restart attempt and it still hasn't restarted. The balance has been running 3 days...

Any ideas on how long it will take to complete the balance? Any suggestions of what I can do? I obviously can't wait several weeks for it to complete but I'm hesitant to force a restart by pulling the plug after reading stories of the NAS becoming unbootable.
Message 1 of 35
mdgm-ntgr
NETGEAR Employee Retired

Re: Aborting a balance

A balance can be cancelled but with SSH stopped and no access to the web interface there's nothing that can be done.

You had 25GB free? If there's only a small amount of space unused then it would be recommended to free up some space before running a balance if you want to run one.
Message 2 of 35
Piglet
Luminary

Re: Aborting a balance

Yes, I had 25GB free and I came to realise that it wasn't enough after the balance wasn't getting anywhere after 20h. My mistake was to do a restart before freeing up more space. I assumed it would halt or pause the balance, not pause the shutdown and make the NAS inaccessible. 😞

I guess I'm left with the dilemma of either cutting the power and risking data loss or waiting an unknown time for the balance to finish.

Is there any way to estimate the time it requires? Even if it had to rewrite every single byte on the drives, one would think it would only take a couple of days. Not several weeks as the initial "0% done" after 20h indicated.

Right now it's about 70h since the balance started.
Message 3 of 35
mdgm-ntgr
NETGEAR Employee Retired

Re: Aborting a balance

Sent you a PM.

Do you have a backup?
Message 4 of 35
BaJohn
Virtuoso

Re: Aborting a balance

Just to say, I tried to provide info on how long a balance takes.
See http://www.readynas.com/forum/viewtopic.php?f=21&t=80244 for what it is worth.
Unless there are other problems or no free space it is usually very quick.
Message 5 of 35
Piglet
Luminary

Followup

So here's what happened.

 

I waited close to two weeks for the balance to complete but when it hadn't done so I finally pulled the cord.

 

It rebooted normally after that, it also automatically restarted the balance (at 0%). Which I stopped gracefully now that I had ssh access again.

 

I found that the filesystem had been damaged when I was trying to delete some files and the filesystem suddenly turned read-only. I gather this is a feature of btrfs, so when something unexpected happens it prevents more damage. A reboot resets the filesystem to the normal read/write.

 

In order to fix the damaged filesystem I have tried a full scrub (which took 2 days and reported no errors) as well as a new balance (with smaller -dusage values) but the problem still remains.

 

The kernel.log messages end with the following:

Jul 21 18:15:54 Nasse kernel: WARNING: at fs/btrfs/super.c:255 __btrfs_abort_transaction+0xa4/0xf0()
Jul 21 18:15:54 Nasse kernel: btrfs: Transaction aborted (error -2)
Jul 21 18:15:54 Nasse kernel: Modules linked in: vpd(P)
Jul 21 18:15:54 Nasse kernel: Backtrace: 
Jul 21 18:15:54 Nasse kernel: [<c003c51c>] (dump_backtrace+0x0/0x110) from [<c061e304>] (dump_stack+0x18/0x20)
Jul 21 18:15:54 Nasse kernel:  r6:000000ff r5:c02d1bec r4:c46c3c78 r3:00000000
Jul 21 18:15:54 Nasse kernel: [<c061e2ec>] (dump_stack+0x0/0x20) from [<c006ab4c>] (warn_slowpath_common+0x54/0x70)
Jul 21 18:15:54 Nasse kernel: [<c006aaf8>] (warn_slowpath_common+0x0/0x70) from [<c006ac0c>] (warn_slowpath_fmt+0x38/0x40)
Jul 21 18:15:54 Nasse kernel:  r8:0000160d r7:c063f74c r6:d4d9f000 r5:d4e2c680 r4:fffffffe
Jul 21 18:15:54 Nasse kernel: r3:00000009
Jul 21 18:15:54 Nasse kernel: [<c006abd4>] (warn_slowpath_fmt+0x0/0x40) from [<c02d1bec>] (__btrfs_abort_transaction+0xa4/0xf0)
Jul 21 18:15:54 Nasse kernel:  r3:fffffffe r2:c0753e18
Jul 21 18:15:54 Nasse kernel: [<c02d1b48>] (__btrfs_abort_transaction+0x0/0xf0) from [<c02e24d8>] (__btrfs_free_extent+0x5a0/0x8cc)
Jul 21 18:15:54 Nasse kernel:  r8:d9f2a240 r7:00000000 r6:001020d8 r5:00000000 r4:00000000
Jul 21 18:15:54 Nasse kernel: [<c02e1f38>] (__btrfs_free_extent+0x0/0x8cc) from [<c02e6638>] (run_clustered_refs+0xa1c/0xe90)
Jul 21 18:15:54 Nasse kernel: [<c02e5c1c>] (run_clustered_refs+0x0/0xe90) from [<c02ea530>] (btrfs_run_delayed_refs+0xbc/0x528)
Jul 21 18:15:54 Nasse kernel: [<c02ea474>] (btrfs_run_delayed_refs+0x0/0x528) from [<c02f9fbc>] (btrfs_commit_transaction+0x90/0x8f0)
Jul 21 18:15:54 Nasse kernel: [<c02f9f2c>] (btrfs_commit_transaction+0x0/0x8f0) from [<c02f3840>] (transaction_kthread+0x1b0/0x1c4)
Jul 21 18:15:54 Nasse kernel: [<c02f3690>] (transaction_kthread+0x0/0x1c4) from [<c00843f0>] (kthread+0x8c/0x94)
Jul 21 18:15:54 Nasse kernel: [<c0084364>] (kthread+0x0/0x94) from [<c006e1f8>] (do_exit+0x0/0x6a8)
Jul 21 18:15:54 Nasse kernel:  r6:c006e1f8 r5:c0084364 r4:d46d3cc0
Jul 21 18:15:54 Nasse kernel: ---[ end trace 464cda4a3b14cdb0 ]---
Jul 21 18:15:54 Nasse kernel: BTRFS error (device md127) in __btrfs_free_extent:5645: errno=-2 No such entry
Jul 21 18:15:54 Nasse kernel: BTRFS info (device md127): forced readonly
Jul 21 18:15:54 Nasse kernel: BTRFS debug (device md127): run_one_delayed_ref returned -2
Jul 21 18:15:54 Nasse kernel: BTRFS error (device md127) in btrfs_run_delayed_refs:2688: errno=-2 No such entry

 

Any suggestions to repair this? I've read about btrfsck but I wanted to check if there's other options before doing that since it seems like a last option.

 

Also, I'd like to suggest that ReadyNAS always pauses any btrfs operation like balance before it reboots instead of waiting for it to complete. In order to prevent things like this from happening.

 

Message 6 of 35
mdgm-ntgr
NETGEAR Employee Retired

Re: Followup

Do you have a backup? If not I would suggest backing up your data (if you can) as the next step.

 

We have a change coming in a future firmware release to attempt to cancel any running balances before shutting down.

 

Please send your logs in (see the Sending Logs link in my sig)

Message 7 of 35
Piglet
Luminary

Re: Followup

Good to hear that future versions might avoid this problem. As for my current issue; I don't have a backup but I'm working on finding space for the data and making copies at the moment.

 

I have sent you the logs.

 

Message 8 of 35
Piglet
Luminary

Re: Followup

I have now copied all my data off the NAS.

 

Unfortunately I lost a 400GB directory after moving it to a connected USB disk using the web interface. After leaving the copy overnight I woke up to a frozen NAS. I had to cut the power to reboot it. The logs indicated it had gone into read-only mode again and the directory I had copied had been deleted, but the disk appeared empty. Using a recovery software I eventually found about 200GB of the files although most of them without filenames, so it will take a long time to piece together what is what.

 

I ran "btrfs check" on the disk:

# btrfs check /dev/md127
Checking filesystem on /dev/md127
UUID: 34bda540-18c4-4437-b708-f7d6d81b53c3
checking extents
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
checking csums
checking root refs
found 1762329360808 bytes used err is 0
total csum bytes: 1047206540
total tree bytes: 2745073664
total fs tree bytes: 1211465728
total extent tree bytes: 228753408
btree space waste bytes: 484263440
file data blocks allocated: 67038373376000
 referenced 5557936832512
Btrfs v3.17.3

After doing that I tried once again to delete files, but it again triggered the errors leading to the read-only state.

 

Is there anything else I could try before giving up and reformatting everything?

Message 9 of 35
mdgm-ntgr
NETGEAR Employee Retired

Re: Followup

Sent you a PM

Message 10 of 35
EKroboter
Apprentice

Re: Followup

The 6.4 firmware updated continues to screw up everything, it's becoming the worst update ever from Netgear.

During a disk balance task, our 516 completely locks up. No frontview access, no SSH, no ping responses. Nothing.

 

After manually restarting the device, the disk balance starts all over again and will lockup the system eventually (sometimes at 40%, others at 62%, it's completely random).

 

I have no way to cancel this job from frontview. I have ssh access but I need furhter instructions.

Message 11 of 35
Piglet
Luminary

Re: Followup

The command to stop a balance is:

btrfs fi balance cancel /data

(where /data is the path)

 

If you want to see the current status of running balance operations, use:

btrfs fi balance status /data

 

Message 12 of 35
EKroboter
Apprentice

Re: Followup

Thanks man!

Should this be a quick kill? My terminal has been like this for the past few minutes:

 

Welcome to ReadyNASOS 6.4.0

Last login: Mon Oct 19 11:56:23 2015 from 192.168.1.73
root@NAS-EK:~# btrfs fi balance cancel /data

And the frontview is completely unresponsive. 

However, ths second command shows:

 

Last login: Mon Oct 19 12:00:30 2015 from 192.168.1.73
root@NAS-EK:~# btrfs fi balance status /data
Balance on '/data' is running, cancel requested
2 out of about 436 chunks balanced (1132 considered), 100% left
root@NAS-EK:~# 
Message 13 of 35
Piglet
Luminary

Re: Followup


@EKroboter wrote:

Should this be a quick kill? My terminal has been like this for the past few minutes:

I'm not sure. I've only done it once and as far as I remember it was fairly quick, but not instantaneous. I'm guessing it has to finish up the current chunk before it can cancel gracefully.
Message 14 of 35
EKroboter
Apprentice

Re: Followup

Thanks. I'll wait it out. In the meanitime the NAS and all the shares are fully accesible, so at least we cna get some work done. The frontview won't load though. 

Message 15 of 35
EKroboter
Apprentice

Re: Followup

It eventually stopped and now fonrtview is accesible and fast again. No connections issues so far, performance seems to be on par as with yesterday.

I cancelled all scheduled defrag, scrub and balance tasks for good measure.

Message 16 of 35
pdkillian
Aspirant

Re: Followup

I had the same problem with my 314.  Glad there was a way to cancel the disk balance.

 

Is there a way to disable disk balance?

 

 

Message 17 of 35
EKroboter
Apprentice

Re: Followup

The ssh command worked for me, and I also disabled schedule disk maintenance (defrag, scrub and balance) for the time being.

Message 18 of 35
mschaffl
Aspirant

Re: Aborting a balance

 
Message 19 of 35
rugene
Guide

Re: Aborting a balance

similar issues with balance for me on a RN314.

I used the "btrfs fi balance status /data" command to kill the operation.

Running RAID5 xraid; 6.4.0; 4ea 4TB red drives. 2.2TB free space

Balance hangs system everytime.

 

I've also stopped scrub and defrag until this issue gets resolved.

 

not good...

Message 20 of 35
tfau
Aspirant

Re: Aborting a balance

Add another RN314 to the sad list of hung systems after a balance.

 

6.4.1 can't come soon enough.

Message 21 of 35
JennC
NETGEAR Employee Retired

Re: Aborting a balance

Hello tfau,

 

Please try using 6.4.1-RC3 and see if the same problem.

 

Regards,

Message 22 of 35
Les62
Aspirant

Re: Aborting a balance

Took me 4 attempts at killing the balance!!! Grrr!

Message 23 of 35
quickly_now
Apprentice

Re: Aborting a balance

I have a similar issue using OS 6.4.1 on a 314. Balance is running, and has been for > 12 hours (to get 5% complete).

 

The machine is barely responsive - copies from windows PC taking a very long time, frontview sometimes responds, sometimes does not.

 

When the machine was running the 6.2 seroes of OS the balance performance was OK.

 

A performance improvement here would be nice, please.

Message 24 of 35
mdgm-ntgr
NETGEAR Employee Retired

Re: Aborting a balance

quickly_now, if you can download your logs, can you send those in please (see the Sending Logs link in my sig)?

Message 25 of 35
Top Contributors
Discussion stats
  • 34 replies
  • 9717 views
  • 2 kudos
  • 12 in conversation
Announcements