Performance during scrub

StephenB
Guru - Experienced User
Jun 11, 2019
hmuessig wrote:

It would be nice if NetGear had a short tutorial on reading the smartctl report! had to dig a bit to find which attributes are the critical ones and how to interpret the values.

I agree.

Though I don't think development and support/mods are on the same page (since support routinely suggests replacing disks that don't trigger email alerts). Personally I'm with support - I think the alert thresholds for pending and realllocated sectors are too low.
btaroli
Prodigy
Jun 11, 2019
Topic creep! heh

As for email alerts on pending sectors and remaps, I tend to agree. But if you download the logs and look at the disk report it's not quite as bad as smartctl output. :smileyhappy: That's partly what drove my replacement of a couple of disks recently. Also I have observed that some disks will fail the "disk test" without throwing email notifications or having rapidly increasing remap/pending counts.

But back onto the topic, I found that without the baggage of the btrfs scrub, the md resync went very quickly and without impacting other processes. When I went to restart the btrfs scrub (and I was careful to set the ioprio class to 3 (idle)) the kworkers quickly sucked up all the cpu oxygen. So much that proceses on the NAS and trying to access it stalled or timed out. Note that this is on a 528X, not ARM. Sure it only has two hyperthreaded cores... but this isn't a slouch either. But if you park a large enough number of kworkers using 90-100% CPU, any machine will crumble.

I did actually stumble across an old thread around the time of ROS 6.7 in which this topic got a not insignificant amount of discussion. If memory serves, the kworkers were expected to be handing the checksum calculations. But why so many of them monopolized the CPU, or what could be done to mitigate that, were never really clear. Unfortunately, the larger the capacity you're running the worse this seems to be... and certainly the impact lasts much longer.
btaroli
Prodigy
Jun 11, 2019
FWIW...

6.4: https://community.netgear.com/t5/Using-your-ReadyNAS-in-Business/6-4-makes-our-516-to-lock-up-during-disk-balance-need-to-abort/td-p/994880

6.6: https://community.netgear.com/t5/Using-your-ReadyNAS-in-Business/6-6-1-Scrub-Hammers-CPU/td-p/1209059

6.7: https://community.netgear.com/t5/ReadyNAS-Beta/ReadyNAS-OS-6-Software-Version-6-7-5-Question/td-p/1311793
StephenB
Guru - Experienced User
Jun 12, 2019
btaroli wrote:

But back onto the topic, I found that without the baggage of the btrfs scrub, the md resync went very quickly and without impacting other processes. When I went to restart the btrfs scrub (and I was careful to set the ioprio class to 3 (idle)) the kworkers quickly sucked up all the cpu oxygen. So much that proceses on the NAS and trying to access it stalled or timed out. Note that this is on a 528X, not ARM. Sure it only has two hyperthreaded cores... but this isn't a slouch either. But if you park a large enough number of kworkers using 90-100% CPU, any machine will crumble.

I haven't seen this with my RN526x. The maintenance scrub ran last month, and the performance didn't drop so much that it interfered with our usage.

Though I agree that running the btrfs scrub and the mdadm scrub in parallel sounds like a bad idea.
btaroli
Prodigy
Jun 17, 2019
Mmm.. Indeed.

So I noticed the reference to reduced CPU overhead with btrfs scrub (not ROS resync+scrub) with a proper defrag. Now i do have the balance and defrag volume jobs scheduled and they run regularly. I notice that the defrag tends not to take that long. I have millions of files on my NAS though (across several shares), and I always surprises me how fast that defrag finishes.

So I ran a defrag manually, as an -exec within find to target full depth folders in all my shares, app folders, etc. It took a more realistic amount of time, and I ran it repeatedly... noticing that subsequent runs did indeed go much faster.

After doing several rounds of that, I triggered a btrfs scrub with ioprio class 3 (idle). I did notice significant differences. In particular, kworker CPU usage was more in the range of 50-70%. Applications like znc, PLEX, DVBLogic/TVMosaic had no trouble transcoding. The btrfs scrub processes backed off much more nicely, which translated into the kworkers backing off as well (since btrfs wasn't trying to checksum as many blocks at once).

But I *did* notice that smb was noticeably laggy in starting new connections/operations. I waited maybe 30 seconds for a simple multi-GB file copy (single file) to start. Once it began it ran quite well. I did also notice that Time Machine didn't proceed very well (like a couple hundred KB per half hour), but it also didn't fail and tag the backup archive corrupt.

So... better... but it does leave me wondering why the volume defrag job dones't seem to be a complete job, or perhaps is it timing out and being canceled? Hmm.

Bottom line, unfortunately, is that I can't trust that scrub job to do the right thing. I'm also wondering if the defrag job is doing a complete job of defragging as well.
StephenB
Guru - Experienced User
Jun 17, 2019
btaroli wrote:

I'm also wondering if the defrag job is doing a complete job of defragging as well.

There are a couple of parameters - (-l and -t). I don't have any idea how Netgear sets them though.
ronaldvr2132
Apprentice
Oct 05, 2019
There are no recent posts to this thread, but will the issue that a disk scrub runs for days and occupying that much resources that the NAS becomes unusable be resolved? In April 2017 a disk scrub ran for 1 day, in April 2018 it became 2 days and sicns october 2018 it varies between 2-5 days. The actual data usage on the NAS did not vary that much. I don't care how long it takes to be hones but the issue is that the NAS is not usable at the time that a disk scrub runs. I almost have sent back my PC which has makes use of a iSCSI disk on my NAS as it was not working any longer only to realize that it was my NAS that apparantley was not having enough resources available. I have one of the higher performance NAS's of Netgear (the RN628X) and am using it as a single user with only one iSCSI LUN and no other apps or whatsoever installed. I have a quarterly disk scrub scheduled which I will siadble once this disk scrub is finished as it is for me unacceptable that my NAS can't be used for a period that long as it takes to finish a disk scrub. Hope this can be resolved in a future OS release.
Sandshark
Sensei - Experienced User
Oct 05, 2019
At least part of the problem is that a scrub scheduled or manually invoked by the GUI performs a BTRFS scrub and an MDADM re-sync simultaneously, creating lots of drive I/O and thrashing. The situation is far worse on an EDA500 due to the eSATA interface adding an additional bottleneck. Scrubs alone initiated via SSH definately do not have as bad an impact (though they do have some).

While I can understand the need for both, I do not understand how anyone thinks that performing these two drive-intensive operations simultaneously is a reasonable approach. I do not know if the re-sync was added at some point, which would account for the difference in earlier OS versions.

My best guess as to why they are simulataneous: so that the OS need only look at the scrub progress provided by BTRFS (which typically ends last) for the displayed progress bar. That's not a good reason when the results are so bad.
ronaldvr2132
Apprentice
Nov 12, 2019
I thought I disabled the scrub on both of my NAS devices, but apparently not :smileyfrustrated: My second NAS with the latest OS 6.10.2 has worked for 11 days now and getting from 86% to 88% took already more than 3 days. I pressed the cross besides the progress bar to stop the scrub. Is that Oke to do or can I have created a corruption by cancelling the scrub progress. The log shows that the scrub process was stopped so I guess all is fine. I just want to check to be sure and I do hope Netgear solves this as soon as possible.
btaroli
Prodigy
Nov 12, 2019
Yes, that is perfectly fine. Some people never report problems with this, but I've never had a scrub go quickly... or normally.

The underlying issue seems to be that a "scrub" actually triggers a Btrfs scrub *and* an md sync at the same time. Awful. scrub should be scrub, period.

So, with ssh login I've done btrfs scrubs w/o issue, but those can't be scheduled. There may well be a reason they run them this way, but it's horribad for performance in at least some use cases.
btaroli
Prodigy
Dec 25, 2019
I can't say I'm hopeful. It's come up many times over years and there has been no movement on it.

What I've taken to doing is running defrags, scrubs, and balances via SSH. In particular, I launch the scrub with
btrfs scrub start -c 3 /data
and check status with
btrfs scrub status /data
. Should you need/wish you may stop the operation safety by using
btrfs scrub cancel /data
ronaldvr2132
Apprentice
Feb 20, 2020
Thank you btaroli it is a shame that this is not resolved whereas this appratnly is already know by Netgear for years. I have had a lot of issues with my RN628X's and I will reconsider when I replace them if I will stick with the ReadyNas series. The isuses I have had are:
- I have had a bricked RN628X during a firmware update;
- I have had a RN628X that all in a sudden without any notice or root cause got a read pnly volume;
- The RN628X I use as my main device is a lot of the times extremely slow without me understaindfing why (no apps installed and onlu opne activce user via a 1 Gigabit connection not doring anything particular). This sloweness is soo bad I had to abbandon the use of my only iSCSCI LUN that I was using as the disk was contantsly dropping off;
- This main RN628X now performing the regular disk check and whereas this was taking normally max 1 day it is now running for 5 days and I can's see a precentage of completion so I have not a clue if it got stalled or something. I will wait a couple of days more and then perform a reboot to see if that helps;
- This main RN628X has back-up jobs to my second RN628X and all jobs run fine except for one. Also here: I would not have a glue why this is.

All in all my trust in the RN628X is not what I want it to be. I do hope Netgear is reading the community messages and improving on this as I see a lot of room for improvement! I will use your steps to perfrom a scrub test. Do you by chance also have the commands I can use for a SSH session on defrag, balance and disk test as well?

Forum Discussion

Related Content

RN424 NAS access during scrub

Astonishingly slow scrub

Disk errors & a slow scrub - can I replace the drive during a scrub?

ReadyNas 214 - BTRFS error during scrub - Stuck in readonly

RN102 slow performances

NETGEAR Academy

ProSupport for Business