First off, yes I’ve definitely commented on this in the past. But this time I have context I didn’t have before. In particular, in the past week or so I have replaced three drives (due to increasing remapped or pending blocks) and increased capacity in a fourth. So I’ve had the opportunity to be using the system while a stripe was resyncing or reshaping (two drives went from 4 to 8TB).What I found is that during the drive swaps is that I experienced almost no performance issues, even with Plex, DVBLogic, or TVMosaic doing transcoding. But of course I wanted to run the final result through the paces of rebalance, defrag, and scrub.What I have found is the the scrub is causing real performance headaches. CPUs are being monopolized by kernel threads and this results in transcodes pausing so often as to be completely useless. It’s bad.But when I look at /proc/mdstat it’s just a resync! So why is a resync triggered by a scrub resulting in such different performance to one triggered by drive replacements? I don’t get that.I’m also wondering why a scrub isn’t a btrfs scrub, but that’s a bit of an aside...

I can't say I'm hopeful. It's come up many times over years and there has been no movement on it. What I've taken to doing is running defrags, scrubs, and balances via SSH. In particular, I launch the scrub withbtrfs scrub start -c 3 /dataand check status withbtrfs scrub status /data. Should you need/wish you may stop the operation safety by usingbtrfs scrub cancel /data

btaroli wrote:I’m also wondering why a scrub isn’t a btrfs scrub, but that’s a bit of an aside... I believe that the maintenance task is does both an mdadm scrub and a btrfs scrub. btaroli wrote:So why is a resync triggered by a scrub resulting in such different performance to one triggered by drive replacements? I don’t get that. No explanation (and on my main NAS, scrubs don't disrupt our normal usage they way they are on your system). Can you see the arguments being passed into mdadm?

No, I don't see where the mdadm options were shown, but I do see this in the system journal. Jun 08 11:53:07 bigbird kernel: md: requested-resync of RAID array md127 Jun 08 11:53:07 bigbird kernel: md: minimum _guaranteed_ speed: 30000 KB/sec/disk. Jun 08 11:53:07 bigbird kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync. Jun 08 11:53:07 bigbird kernel: md: using 128k window, over a total of 3902166784k. Jun 08 11:53:07 bigbird readynasd[5577]: scrub started on /data, fsid a00de7ae-c40a-4b0e-8eee-96bbebcf08ca (pid=10930) Jun 08 11:53:07 bigbird readynasd[5577]: Scrub started for volume data. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sda3 also saved as /var/backups/md/tmp/ata-ST4000DM000-1F2168_Z301Y1GN-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdb3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1ELCK8-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdd3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EGR7BDX-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sde3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EGS8SHX-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdf3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1EMQSK-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdg3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EK31A0X-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdh3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1ES9J9-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdi3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1EVFHK-part3. Jun 08 11:53:10 bigbird msmtpq[10967]: mail for [ -C /etc/msmtprc nasalerts@billsden.org --timeout=60 ] : send was successful Jun 08 11:53:14 bigbird mdadm[4919]: 10926 (process ID) old priority 0, new priority 19 Jun 08 11:53:14 bigbird mdadm[4919]: Backing up RAID configs... Jun 08 11:53:14 bigbird mdadm[4919]: found 33 raid config backups Jun 08 11:53:14 bigbird mdadm[4919]: pruning raid config /var/backups/md/raid_config_data-0_2017_07_05_222327.tar.xz Jun 08 11:53:14 bigbird mdadm[4919]: RebuildStarted event detected on md device /dev/md127, component device resync I hadn't looked before but I DO see that there is a btrfs scrub running on /data simultaneous with mdadm's resync... which really doesn't seem like such a great idea, given that they are both heavy I/O tasks. I've canceled the btrfs scrub directly, and the machine is MUCH more responsive now. Perhaps I'll kick it off manually once the mdadm resync completes. I can already see the data rate of the resync has increased from the minimum up to 70MB/sec, and in the past I've seen it get upwards of 130MB/sec. Plex playback (of the same program I was struggling with last night, is completely solid. So now I just need to know whether these two tasks are expected to be running simultaneously and, if not, why the heck they are behaving that way for me.

I have also observed this, and mentioned it here to apparently deaf ears of Netgear. It's most egregious on an EDA500 and pretty bad on a 100-series NAS. Pretty clearly, the problem is that the scrub initiated by the GUI, either by schedule or manually, triggers the BTRFS scub and an MDADM re-sync at the same time. If I command a scrub or re-sync individually via SSH, there is no issue. I suspect the re-sync and scrub start having read/write collision issues and, eventially, that drives the readynasd process to 100% usage. I can only guess here that readynasd also has a read/write collision but is not forgiving of it and just begins to try and assert access to the drives with more and more unsuccessful attempts. Once that happens, the NAS is effectively offline. I have only been able to determine this by having one or more SSH processes already open before starting the scrub, as even SSH login becomes impossible at that point. The scrub then slows even more, so it takes a long time to clear up. But if you have days to wait, it does eventually seem to work it's way through it. The re-sync barely makes any progress while the scrub is ongoing, so I fail to see the point of starting them simultaneously. The negative consequences, on the other hand, are abundantly clear.

My observation is that the CPU usage isn't with readynasd but instead the kworker's. Given how the md resync goes, that's likely coming from the btrfs scrub, and of course when I stopped the scrub the kworkers shut up completely. But apart from the poor interaction between mdadm and btrfs/kernel, there is the not so insignificant matter of the actually file services and applications. Indeed I recall seeing lots of SQLite timeouts in the journal as well, suggesting that ROS/frontview was having issues just working itself. It's entirely possible for code to monitor the actual progress of both resync and scrub, so why these would be triggered simultaneously is baffling. I'm definitely disabling scheduled scrub on all the boxes in my purvey until this is addressed. What's concerning is that this isn't a new issue... I can recall seeing various complaints of poor scrub performance for many years (including a few of my own posts). I hesitate to log this as an idea, because it's more a bug than an enhancement. Do we have a vehicle to raise bugs? This definitely is one.

Performance during scrub | NETGEAR Communities

20 Replies

Replies have been turned off for this discussion

StephenB
Guru - Experienced User
Jun 09, 2019
btaroli wrote:

I’m also wondering why a scrub isn’t a btrfs scrub, but that’s a bit of an aside...

I believe that the maintenance task is does both an mdadm scrub and a btrfs scrub.

btaroli wrote:
So why is a resync triggered by a scrub resulting in such different performance to one triggered by drive replacements? I don’t get that.

No explanation (and on my main NAS, scrubs don't disrupt our normal usage they way they are on your system). Can you see the arguments being passed into mdadm?
- btaroli
  Prodigy
  Jun 09, 2019
  No, I don't see where the mdadm options were shown, but I do see this in the system journal.
  
  Jun 08 11:53:07 bigbird kernel: md: requested-resync of RAID array md127 Jun 08 11:53:07 bigbird kernel: md: minimum _guaranteed_ speed: 30000 KB/sec/disk. Jun 08 11:53:07 bigbird kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync. Jun 08 11:53:07 bigbird kernel: md: using 128k window, over a total of 3902166784k. Jun 08 11:53:07 bigbird readynasd[5577]: scrub started on /data, fsid a00de7ae-c40a-4b0e-8eee-96bbebcf08ca (pid=10930) Jun 08 11:53:07 bigbird readynasd[5577]: Scrub started for volume data. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sda3 also saved as /var/backups/md/tmp/ata-ST4000DM000-1F2168_Z301Y1GN-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdb3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1ELCK8-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdd3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EGR7BDX-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sde3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EGS8SHX-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdf3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1EMQSK-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdg3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EK31A0X-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdh3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1ES9J9-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdi3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1EVFHK-part3. Jun 08 11:53:10 bigbird msmtpq[10967]: mail for [ -C /etc/msmtprc nasalerts@billsden.org --timeout=60 ] : send was successful Jun 08 11:53:14 bigbird mdadm[4919]: 10926 (process ID) old priority 0, new priority 19 Jun 08 11:53:14 bigbird mdadm[4919]: Backing up RAID configs... Jun 08 11:53:14 bigbird mdadm[4919]: found 33 raid config backups Jun 08 11:53:14 bigbird mdadm[4919]: pruning raid config /var/backups/md/raid_config_data-0_2017_07_05_222327.tar.xz Jun 08 11:53:14 bigbird mdadm[4919]: RebuildStarted event detected on md device /dev/md127, component device resync
  
  I hadn't looked before but I DO see that there is a btrfs scrub running on /data simultaneous with mdadm's resync... which really doesn't seem like such a great idea, given that they are both heavy I/O tasks. I've canceled the btrfs scrub directly, and the machine is MUCH more responsive now. Perhaps I'll kick it off manually once the mdadm resync completes.
  
  I can already see the data rate of the resync has increased from the minimum up to 70MB/sec, and in the past I've seen it get upwards of 130MB/sec. Plex playback (of the same program I was struggling with last night, is completely solid. So now I just need to know whether these two tasks are expected to be running simultaneously and, if not, why the heck they are behaving that way for me.
  - Sandshark
    Sensei - Experienced User
    Jun 09, 2019
    I have also observed this, and mentioned it here to apparently deaf ears of Netgear. It's most egregious on an EDA500 and pretty bad on a 100-series NAS. Pretty clearly, the problem is that the scrub initiated by the GUI, either by schedule or manually, triggers the BTRFS scub and an MDADM re-sync at the same time. If I command a scrub or re-sync individually via SSH, there is no issue. I suspect the re-sync and scrub start having read/write collision issues and, eventially, that drives the readynasd process to 100% usage. I can only guess here that readynasd also has a read/write collision but is not forgiving of it and just begins to try and assert access to the drives with more and more unsuccessful attempts. Once that happens, the NAS is effectively offline. I have only been able to determine this by having one or more SSH processes already open before starting the scrub, as even SSH login becomes impossible at that point.
    
    The scrub then slows even more, so it takes a long time to clear up. But if you have days to wait, it does eventually seem to work it's way through it.
    
    The re-sync barely makes any progress while the scrub is ongoing, so I fail to see the point of starting them simultaneously. The negative consequences, on the other hand, are abundantly clear.

Forum Discussion

Performance during scrub

20 Replies

Related Content

RN424 NAS access during scrub

Astonishingly slow scrub

Disk errors & a slow scrub - can I replace the drive during a scrub?

ReadyNas 214 - BTRFS error during scrub - Stuck in readonly

RN102 slow performances

NETGEAR Academy

ProSupport for Business