× NETGEAR will be terminating ReadyCLOUD service by July 1st, 2023. For more details click here.
Orbi WiFi 7 RBE973
Reply

Re: Performance during scrub

btaroli
Prodigy

Performance during scrub

First off, yes I’ve definitely commented on this in the past. But this time I have context I didn’t have before. In particular, in the past week or so I have replaced three drives (due to increasing remapped or pending blocks) and increased capacity in a fourth. So I’ve had the opportunity to be using the system while a stripe was resyncing or reshaping (two drives went from 4 to 8TB).

What I found is that during the drive swaps is that I experienced almost no performance issues, even with Plex, DVBLogic, or TVMosaic doing transcoding. But of course I wanted to run the final result through the paces of rebalance, defrag, and scrub.

What I have found is the the scrub is causing real performance headaches. CPUs are being monopolized by kernel threads and this results in transcodes pausing so often as to be completely useless. It’s bad.

But when I look at /proc/mdstat it’s just a resync! So why is a resync triggered by a scrub resulting in such different performance to one triggered by drive replacements? I don’t get that.

I’m also wondering why a scrub isn’t a btrfs scrub, but that’s a bit of an aside...
Model: RN528X|ReadyNAS 528X - Premium Performance Business Data Storage - 8-Bay
Message 1 of 21
StephenB
Guru

Re: Performance during scrub


@btaroli wrote:

I’m also wondering why a scrub isn’t a btrfs scrub, but that’s a bit of an aside...

I believe that the maintenance task is does both an mdadm scrub and a btrfs scrub.

 


@btaroli wrote:
So why is a resync triggered by a scrub resulting in such different performance to one triggered by drive replacements? I don’t get that.


No explanation (and on my main NAS, scrubs don't disrupt our normal usage they way they are on your system).  Can you see the arguments being passed into mdadm?

Message 2 of 21
btaroli
Prodigy

Re: Performance during scrub

No, I don't see where the mdadm options were shown, but I do see this in the system journal.

 

Jun 08 11:53:07 bigbird kernel: md: requested-resync of RAID array md127
Jun 08 11:53:07 bigbird kernel: md: minimum _guaranteed_ speed: 30000 KB/sec/disk.
Jun 08 11:53:07 bigbird kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync.
Jun 08 11:53:07 bigbird kernel: md: using 128k window, over a total of 3902166784k.
Jun 08 11:53:07 bigbird readynasd[5577]: scrub started on /data, fsid a00de7ae-c40a-4b0e-8eee-96bbebcf08ca (pid=10930)
Jun 08 11:53:07 bigbird readynasd[5577]: Scrub started for volume data.
Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sda3 also saved as /var/backups/md/tmp/ata-ST4000DM000-1F2168_Z301Y1GN-part3.
Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdb3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1ELCK8-part3.
Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdd3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EGR7BDX-part3.
Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sde3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EGS8SHX-part3.
Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdf3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1EMQSK-part3.
Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdg3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EK31A0X-part3.
Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdh3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1ES9J9-part3.
Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdi3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1EVFHK-part3.
Jun 08 11:53:10 bigbird msmtpq[10967]: mail for [ -C /etc/msmtprc nasalerts@billsden.org --timeout=60 ] : send was successful
Jun 08 11:53:14 bigbird mdadm[4919]: 10926 (process ID) old priority 0, new priority 19
Jun 08 11:53:14 bigbird mdadm[4919]: Backing up RAID configs...
Jun 08 11:53:14 bigbird mdadm[4919]: found 33 raid config backups
Jun 08 11:53:14 bigbird mdadm[4919]: pruning raid config /var/backups/md/raid_config_data-0_2017_07_05_222327.tar.xz
Jun 08 11:53:14 bigbird mdadm[4919]: RebuildStarted event detected on md device /dev/md127, component device resync

 

I hadn't looked before but I DO see that there is a btrfs scrub running on /data simultaneous with mdadm's resync... which really doesn't seem like such a great idea, given that they are both heavy I/O tasks. I've canceled the btrfs scrub directly, and the machine is MUCH more responsive now. Perhaps I'll kick it off manually once the mdadm resync completes.

 

I can already see the data rate of the resync has increased from the minimum up to 70MB/sec, and in the past I've seen it get upwards of 130MB/sec. Plex playback (of the same program I was struggling with last night, is completely solid. So now I just need to know whether these two tasks are expected to be running simultaneously and, if not, why the heck they are behaving that way for me.

Model: RN528X|ReadyNAS 528X - Premium Performance Business Data Storage - 8-Bay
Message 3 of 21
Sandshark
Sensei

Re: Performance during scrub

I have also observed this, and mentioned it here to apparently deaf ears of Netgear.  It's most egregious on an EDA500 and pretty bad on a 100-series NAS.  Pretty clearly, the problem is that the scrub initiated by the GUI, either by schedule or manually, triggers the BTRFS scub and an MDADM re-sync at the same time.  If I command a scrub or re-sync individually via SSH, there is no issue.  I suspect the re-sync and scrub start having read/write collision issues and, eventially, that drives the readynasd process to 100% usage.  I can only guess here that readynasd also has a read/write collision but is not forgiving of it and just begins to try and assert access to the drives with more and more unsuccessful attempts.  Once that happens, the NAS is effectively offline.  I have only been able to determine this by having one or more SSH processes already open before starting the scrub, as even SSH login becomes impossible at that point.

 

The scrub then slows even more, so it takes a long time to clear up.  But if you have days to wait, it does eventually seem to work it's way through it.

 

The re-sync barely makes any progress while the scrub is ongoing, so I fail to see the point of starting them simultaneously.  The negative consequences, on the other hand, are abundantly clear.

Message 4 of 21
btaroli
Prodigy

Re: Performance during scrub

My observation is that the CPU usage isn't with readynasd but instead the kworker's. Given how the md resync goes, that's likely coming from the btrfs scrub, and of course when I stopped the scrub the kworkers shut up completely. But apart from the poor interaction between mdadm and btrfs/kernel, there is the not so insignificant matter of the actually file services and applications. Indeed I recall seeing lots of SQLite timeouts in the journal as well, suggesting that ROS/frontview was having issues just working itself.

 

It's entirely possible for code to monitor the actual progress of both resync and scrub, so why these would be triggered simultaneously is baffling. I'm definitely disabling scheduled scrub on all the boxes in my purvey until this is addressed. What's concerning is that this isn't a new issue... I can recall seeing various complaints of poor scrub performance for many years (including a few of my own posts).

 

I hesitate to log this as an idea, because it's more a bug than an enhancement. Do we have a vehicle to raise bugs? This definitely is one.

 

Message 5 of 21
Sandshark
Sensei

Re: Performance during scrub

The kworkers do take up some cpu time and can slow things down, especially on the lower-powered NASes.  But it's only when it "bleeds over" to readynsd that I've found the NAS completely unaccessible.  But I do only have one ARM based NAS that I only occasionally do testing with, so my experience there is limited.

Message 6 of 21
hmuessig
Luminary

Re: Performance during scrub

FWIW, My 314 has been doing a scrub now for almost 36 hours and is completely unresponsive to any attempt to log into it . . .   And using either finder (Mac) or Explorer (Windows 10) results in the NAS being unreachable.

 

OS is 6.10.1, 4 2TB Reds.

 

Model: RN31442E|ReadyNAS 300 Series 4- Bay (4x 2TB Enterprise)
Message 7 of 21
StephenB
Guru

Re: Performance during scrub


@hmuessig wrote:

FWIW, My 314 has been doing a scrub now for almost 36 hours and is completely unresponsive to any attempt to log into it . . .   And using either finder (Mac) or Explorer (Windows 10) results in the NAS being unreachable.

 

OS is 6.10.1, 4 2TB Reds.

 


This sounds like one of the disks might be failing.  Do you have ssh access?

Message 8 of 21
hmuessig
Luminary

Re: Performance during scrub

Good call StephanB! Looks like two drives are failing!

 

I do have SSH and used smartctl -x /dev/sdx where "x" is "a" through "d" for the four drives.

 

It would be nice if NetGear had a short tutorial on reading the smartctl report! had to dig a bit to find which attributes are the critical ones and how to interpret the values.

 

Not a really big deal as this my test NAS. So the two failing drives will get replaced shortly.

 

Tx!

Model: RN31443E|ReadyNAS 300 Series 4- Bay (4x 3TB Enterprise)
Message 9 of 21
StephenB
Guru

Re: Performance during scrub


@hmuessig wrote:

 

It would be nice if NetGear had a short tutorial on reading the smartctl report! had to dig a bit to find which attributes are the critical ones and how to interpret the values.


I agree. 

 

Though I don't think development and support/mods are on the same page (since support routinely suggests replacing disks that don't trigger email alerts).  Personally I'm with support - I think the alert thresholds for pending and realllocated sectors are too low.

Message 10 of 21
btaroli
Prodigy

Re: Performance during scrub

Topic creep! heh

 

As for email alerts on pending sectors and remaps, I tend to agree. But if you download the logs and look at the disk report it's not quite as bad as smartctl output. Smiley Happy That's partly what drove my replacement of a couple of disks recently. Also I have observed that some disks will fail the "disk test" without throwing email notifications or having rapidly increasing remap/pending counts.

 

But back onto the topic, I found that without the baggage of the btrfs scrub, the md resync went very quickly and without impacting other processes. When I went to restart the btrfs scrub (and I was careful to set the ioprio class to 3 (idle)) the kworkers quickly sucked up all the cpu oxygen. So much that proceses on the NAS and trying to access it stalled or timed out. Note that this is on a 528X, not ARM. Sure it only has two hyperthreaded cores... but this isn't a slouch either. But if you park a large enough number of kworkers using 90-100% CPU, any machine will crumble.

 

I did actually stumble across an old thread around the time of ROS 6.7 in which this topic got a not insignificant amount of discussion. If memory serves, the kworkers were expected to be handing the checksum calculations. But why so many of them monopolized the CPU, or what could be done to mitigate that, were never really clear. Unfortunately, the larger the capacity you're running the worse this seems to be... and certainly the impact lasts much longer.

Model: RN528X|ReadyNAS 528X - Premium Performance Business Data Storage - 8-Bay
Message 11 of 21
StephenB
Guru

Re: Performance during scrub


@btaroli wrote:

 

But back onto the topic, I found that without the baggage of the btrfs scrub, the md resync went very quickly and without impacting other processes. When I went to restart the btrfs scrub (and I was careful to set the ioprio class to 3 (idle)) the kworkers quickly sucked up all the cpu oxygen. So much that proceses on the NAS and trying to access it stalled or timed out. Note that this is on a 528X, not ARM. Sure it only has two hyperthreaded cores... but this isn't a slouch either. But if you park a large enough number of kworkers using 90-100% CPU, any machine will crumble.

 


I haven't seen this with my RN526x.  The maintenance scrub ran last month, and the performance didn't drop so much that it interfered with our usage.

 

Though I agree that running the btrfs scrub and the mdadm scrub in parallel sounds like a bad idea.

 

 

Message 13 of 21
btaroli
Prodigy

Re: Performance during scrub

Mmm.. Indeed.

 

So I noticed the reference to reduced CPU overhead with btrfs scrub (not ROS resync+scrub) with a proper defrag. Now i do have the balance and defrag volume jobs scheduled and they run regularly. I notice that the defrag tends not to take that long. I have millions of files on my NAS though (across several shares), and I always surprises me how fast that defrag finishes.

 

So I ran a defrag manually, as an -exec within find to target full depth folders in all my shares, app folders, etc. It took a more realistic amount of time, and I ran it repeatedly... noticing that subsequent runs did indeed go much faster.

 

After doing several rounds of that, I triggered a btrfs scrub with ioprio class 3 (idle). I did notice significant differences. In particular, kworker CPU usage was more in the range of 50-70%. Applications like znc, PLEX, DVBLogic/TVMosaic had no trouble transcoding. The btrfs scrub processes backed off much more nicely, which translated into the kworkers backing off as well (since btrfs wasn't trying to checksum as many blocks at once).

 

But I *did* notice that smb was noticeably laggy in starting new connections/operations. I waited maybe 30 seconds for a simple multi-GB file copy (single file) to start. Once it began it ran quite well. I did also notice that Time Machine didn't proceed very well (like a couple hundred KB per half hour), but it also didn't fail and tag the backup archive corrupt.

 

So... better... but it does leave me wondering why the volume defrag job dones't seem to be a complete job, or perhaps is it timing out and being canceled? Hmm.

 

Bottom line, unfortunately, is that I can't trust that scrub job to do the right thing. I'm also wondering if the defrag job is doing a complete job  of defragging as well.

Model: RN528X|ReadyNAS 528X - Premium Performance Business Data Storage - 8-Bay
Message 14 of 21
StephenB
Guru

Re: Performance during scrub


@btaroli wrote:

I'm also wondering if the defrag job is doing a complete job  of defragging as well.


There are a couple of parameters - (-l and -t).  I don't have any idea how Netgear sets them though.

Message 15 of 21
ronaldvr2132
Apprentice

Re: Performance during scrub

There are no recent posts to this thread, but will the issue that a disk scrub runs for days and occupying that much resources that the NAS becomes unusable be resolved? In April 2017 a disk scrub ran for 1 day, in April 2018 it became 2 days and sicns october 2018 it varies between 2-5 days. The actual data usage on the NAS did not vary that much. I don't care how long it takes to be hones but the issue is that the NAS is not usable at the time that a disk scrub runs. I almost have sent back my PC which has makes use of a iSCSI disk on my NAS as it was not working any longer only to realize that it was my NAS that apparantley was not having enough resources available. I have one of the higher performance NAS's of Netgear (the RN628X) and am using it as a single user with only one iSCSI LUN and no other apps or whatsoever installed. I have a quarterly disk scrub scheduled which I will siadble once this disk scrub is finished as it is for me unacceptable that my NAS can't be used for a period that long as it takes to finish a disk scrub. Hope this can be resolved in a future OS release.

Model: RN628X|ReadyNAS 628X - Ultimate Performance Business Data Storage - 8-Bay
Message 16 of 21
Sandshark
Sensei

Re: Performance during scrub

At least part of the problem is that a scrub scheduled or manually invoked by the GUI performs a BTRFS scrub and an MDADM re-sync simultaneously, creating lots of drive I/O and thrashing.  The situation is far worse on an EDA500 due to the eSATA interface adding an additional bottleneck.  Scrubs alone initiated via SSH definately do not have as bad an impact (though they do have some).

 

While I can understand the need for both, I do not understand how anyone thinks that performing these two drive-intensive operations simultaneously is a reasonable approach.  I do not know if the re-sync was added at some point, which would account for the difference in earlier OS versions.

 

My best guess as to why they are simulataneous:  so that the OS need only look at the scrub progress provided by BTRFS (which typically ends last) for the displayed progress bar.  That's not a good reason when the results are so bad.

Message 17 of 21
ronaldvr2132
Apprentice

Re: Performance during scrub

I thought I disabled the scrub on both of my NAS devices, but apparently not Smiley Frustrated My second NAS with the latest OS 6.10.2 has worked for 11 days now and getting from 86% to 88% took already more than 3 days. I pressed the cross besides the progress bar to stop the scrub. Is that Oke to do or can I have created a corruption by cancelling the scrub progress. The log shows that the scrub process was stopped so I guess all is fine. I just want to check to be sure and I do hope Netgear solves this as soon as possible.

Message 18 of 21
btaroli
Prodigy

Re: Performance during scrub

Yes, that is perfectly fine. Some people never report problems with this, but I've never had a scrub go quickly... or normally.

 

The underlying issue seems to be that a "scrub" actually triggers a Btrfs scrub *and* an md sync at the same time. Awful. scrub should be scrub, period.

 

So, with ssh login I've done btrfs scrubs w/o issue, but those can't be scheduled. There may well be a reason they run them this way, but it's horribad for performance in at least some use cases.

Message 19 of 21
btaroli
Prodigy

Re: Performance during scrub

I can't say I'm hopeful. It's come up many times over years and there has been no movement on it.

 

What I've taken to doing is running defrags, scrubs, and balances via SSH. In particular, I launch the scrub with

btrfs scrub start -c 3 /data

and check status with

btrfs scrub status /data

. Should you need/wish you may stop the operation safety by using

btrfs scrub cancel /data
Model: RN528X|ReadyNAS 528X - Premium Performance Business Data Storage - 8-Bay
Message 20 of 21
ronaldvr2132
Apprentice

Re: Performance during scrub

Thank you @btaroli it is a shame that this is not resolved whereas this appratnly is already know by Netgear for years. I have had a lot of issues with my RN628X's and I will reconsider when I replace them if I will stick with the ReadyNas series. The isuses I have had are:

- I have had a bricked RN628X during a firmware update;

- I have had a RN628X that all in a sudden without any notice or root cause got a read pnly volume;

- The RN628X I use as my main device is a lot of the times extremely slow without me understaindfing why (no apps installed and onlu opne activce user via a 1 Gigabit connection not doring anything particular). This sloweness is soo bad I had to abbandon the use of my only iSCSCI LUN that I was using as the disk was contantsly dropping off;

- This main RN628X now performing the regular disk check and whereas this was taking normally max 1 day it is now running for 5 days and I can's see a precentage of completion so I have not a clue if it got stalled or something. I will wait a couple of days more and then perform a reboot to see if that helps;

- This main RN628X has back-up jobs to my second RN628X and all jobs run fine except for one. Also here: I would not have a glue why this is.

 

All in all my trust in the RN628X is not what I want it to be. I do hope Netgear is reading the community messages and improving on this as I see a lot of room for improvement! I will use your steps to perfrom a scrub test. Do you by chance also have the commands I can use for a SSH session on defrag, balance and disk test as well? 

Message 21 of 21
Top Contributors
Discussion stats
  • 20 replies
  • 4664 views
  • 1 kudo
  • 5 in conversation
Announcements