NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.
Forum Discussion
btaroli
Jun 09, 2019Prodigy
Performance during scrub
First off, yes I’ve definitely commented on this in the past. But this time I have context I didn’t have before. In particular, in the past week or so I have replaced three drives (due to increasing r...
StephenB
Jun 09, 2019Guru - Experienced User
btaroli wrote:
I’m also wondering why a scrub isn’t a btrfs scrub, but that’s a bit of an aside...
I believe that the maintenance task is does both an mdadm scrub and a btrfs scrub.
btaroli wrote:
So why is a resync triggered by a scrub resulting in such different performance to one triggered by drive replacements? I don’t get that.
No explanation (and on my main NAS, scrubs don't disrupt our normal usage they way they are on your system). Can you see the arguments being passed into mdadm?
- btaroliJun 09, 2019Prodigy
No, I don't see where the mdadm options were shown, but I do see this in the system journal.
Jun 08 11:53:07 bigbird kernel: md: requested-resync of RAID array md127 Jun 08 11:53:07 bigbird kernel: md: minimum _guaranteed_ speed: 30000 KB/sec/disk. Jun 08 11:53:07 bigbird kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync. Jun 08 11:53:07 bigbird kernel: md: using 128k window, over a total of 3902166784k. Jun 08 11:53:07 bigbird readynasd[5577]: scrub started on /data, fsid a00de7ae-c40a-4b0e-8eee-96bbebcf08ca (pid=10930) Jun 08 11:53:07 bigbird readynasd[5577]: Scrub started for volume data. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sda3 also saved as /var/backups/md/tmp/ata-ST4000DM000-1F2168_Z301Y1GN-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdb3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1ELCK8-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdd3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EGR7BDX-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sde3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EGS8SHX-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdf3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1EMQSK-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdg3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EK31A0X-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdh3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1ES9J9-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdi3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1EVFHK-part3. Jun 08 11:53:10 bigbird msmtpq[10967]: mail for [ -C /etc/msmtprc nasalerts@billsden.org --timeout=60 ] : send was successful Jun 08 11:53:14 bigbird mdadm[4919]: 10926 (process ID) old priority 0, new priority 19 Jun 08 11:53:14 bigbird mdadm[4919]: Backing up RAID configs... Jun 08 11:53:14 bigbird mdadm[4919]: found 33 raid config backups Jun 08 11:53:14 bigbird mdadm[4919]: pruning raid config /var/backups/md/raid_config_data-0_2017_07_05_222327.tar.xz Jun 08 11:53:14 bigbird mdadm[4919]: RebuildStarted event detected on md device /dev/md127, component device resync
I hadn't looked before but I DO see that there is a btrfs scrub running on /data simultaneous with mdadm's resync... which really doesn't seem like such a great idea, given that they are both heavy I/O tasks. I've canceled the btrfs scrub directly, and the machine is MUCH more responsive now. Perhaps I'll kick it off manually once the mdadm resync completes.
I can already see the data rate of the resync has increased from the minimum up to 70MB/sec, and in the past I've seen it get upwards of 130MB/sec. Plex playback (of the same program I was struggling with last night, is completely solid. So now I just need to know whether these two tasks are expected to be running simultaneously and, if not, why the heck they are behaving that way for me.
- SandsharkJun 09, 2019Sensei - Experienced User
I have also observed this, and mentioned it here to apparently deaf ears of Netgear. It's most egregious on an EDA500 and pretty bad on a 100-series NAS. Pretty clearly, the problem is that the scrub initiated by the GUI, either by schedule or manually, triggers the BTRFS scub and an MDADM re-sync at the same time. If I command a scrub or re-sync individually via SSH, there is no issue. I suspect the re-sync and scrub start having read/write collision issues and, eventially, that drives the readynasd process to 100% usage. I can only guess here that readynasd also has a read/write collision but is not forgiving of it and just begins to try and assert access to the drives with more and more unsuccessful attempts. Once that happens, the NAS is effectively offline. I have only been able to determine this by having one or more SSH processes already open before starting the scrub, as even SSH login becomes impossible at that point.
The scrub then slows even more, so it takes a long time to clear up. But if you have days to wait, it does eventually seem to work it's way through it.
The re-sync barely makes any progress while the scrub is ongoing, so I fail to see the point of starting them simultaneously. The negative consequences, on the other hand, are abundantly clear.
- btaroliJun 09, 2019Prodigy
My observation is that the CPU usage isn't with readynasd but instead the kworker's. Given how the md resync goes, that's likely coming from the btrfs scrub, and of course when I stopped the scrub the kworkers shut up completely. But apart from the poor interaction between mdadm and btrfs/kernel, there is the not so insignificant matter of the actually file services and applications. Indeed I recall seeing lots of SQLite timeouts in the journal as well, suggesting that ROS/frontview was having issues just working itself.
It's entirely possible for code to monitor the actual progress of both resync and scrub, so why these would be triggered simultaneously is baffling. I'm definitely disabling scheduled scrub on all the boxes in my purvey until this is addressed. What's concerning is that this isn't a new issue... I can recall seeing various complaints of poor scrub performance for many years (including a few of my own posts).
I hesitate to log this as an idea, because it's more a bug than an enhancement. Do we have a vehicle to raise bugs? This definitely is one.
Related Content
NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology!
Join Us!