NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.

Forum Discussion

btaroli's avatar
btaroli
Prodigy
Jun 09, 2019

Performance during scrub

First off, yes I’ve definitely commented on this in the past. But this time I have context I didn’t have before. In particular, in the past week or so I have replaced three drives (due to increasing remapped or pending blocks) and increased capacity in a fourth. So I’ve had the opportunity to be using the system while a stripe was resyncing or reshaping (two drives went from 4 to 8TB).

What I found is that during the drive swaps is that I experienced almost no performance issues, even with Plex, DVBLogic, or TVMosaic doing transcoding. But of course I wanted to run the final result through the paces of rebalance, defrag, and scrub.

What I have found is the the scrub is causing real performance headaches. CPUs are being monopolized by kernel threads and this results in transcodes pausing so often as to be completely useless. It’s bad.

But when I look at /proc/mdstat it’s just a resync! So why is a resync triggered by a scrub resulting in such different performance to one triggered by drive replacements? I don’t get that.

I’m also wondering why a scrub isn’t a btrfs scrub, but that’s a bit of an aside...

20 Replies

Replies have been turned off for this discussion
  • StephenB's avatar
    StephenB
    Guru - Experienced User

    btaroli wrote:

    I’m also wondering why a scrub isn’t a btrfs scrub, but that’s a bit of an aside...

    I believe that the maintenance task is does both an mdadm scrub and a btrfs scrub.

     


    btaroli wrote:
    So why is a resync triggered by a scrub resulting in such different performance to one triggered by drive replacements? I don’t get that.


    No explanation (and on my main NAS, scrubs don't disrupt our normal usage they way they are on your system).  Can you see the arguments being passed into mdadm?

    • btaroli's avatar
      btaroli
      Prodigy

      No, I don't see where the mdadm options were shown, but I do see this in the system journal.

       

      Jun 08 11:53:07 bigbird kernel: md: requested-resync of RAID array md127
      Jun 08 11:53:07 bigbird kernel: md: minimum _guaranteed_ speed: 30000 KB/sec/disk.
      Jun 08 11:53:07 bigbird kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync.
      Jun 08 11:53:07 bigbird kernel: md: using 128k window, over a total of 3902166784k.
      Jun 08 11:53:07 bigbird readynasd[5577]: scrub started on /data, fsid a00de7ae-c40a-4b0e-8eee-96bbebcf08ca (pid=10930)
      Jun 08 11:53:07 bigbird readynasd[5577]: Scrub started for volume data.
      Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sda3 also saved as /var/backups/md/tmp/ata-ST4000DM000-1F2168_Z301Y1GN-part3.
      Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdb3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1ELCK8-part3.
      Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdd3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EGR7BDX-part3.
      Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sde3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EGS8SHX-part3.
      Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdf3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1EMQSK-part3.
      Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdg3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EK31A0X-part3.
      Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdh3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1ES9J9-part3.
      Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdi3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1EVFHK-part3.
      Jun 08 11:53:10 bigbird msmtpq[10967]: mail for [ -C /etc/msmtprc nasalerts@billsden.org --timeout=60 ] : send was successful
      Jun 08 11:53:14 bigbird mdadm[4919]: 10926 (process ID) old priority 0, new priority 19
      Jun 08 11:53:14 bigbird mdadm[4919]: Backing up RAID configs...
      Jun 08 11:53:14 bigbird mdadm[4919]: found 33 raid config backups
      Jun 08 11:53:14 bigbird mdadm[4919]: pruning raid config /var/backups/md/raid_config_data-0_2017_07_05_222327.tar.xz
      Jun 08 11:53:14 bigbird mdadm[4919]: RebuildStarted event detected on md device /dev/md127, component device resync

       

      I hadn't looked before but I DO see that there is a btrfs scrub running on /data simultaneous with mdadm's resync... which really doesn't seem like such a great idea, given that they are both heavy I/O tasks. I've canceled the btrfs scrub directly, and the machine is MUCH more responsive now. Perhaps I'll kick it off manually once the mdadm resync completes.

       

      I can already see the data rate of the resync has increased from the minimum up to 70MB/sec, and in the past I've seen it get upwards of 130MB/sec. Plex playback (of the same program I was struggling with last night, is completely solid. So now I just need to know whether these two tasks are expected to be running simultaneously and, if not, why the heck they are behaving that way for me.

      • Sandshark's avatar
        Sandshark
        Sensei - Experienced User

        I have also observed this, and mentioned it here to apparently deaf ears of Netgear.  It's most egregious on an EDA500 and pretty bad on a 100-series NAS.  Pretty clearly, the problem is that the scrub initiated by the GUI, either by schedule or manually, triggers the BTRFS scub and an MDADM re-sync at the same time.  If I command a scrub or re-sync individually via SSH, there is no issue.  I suspect the re-sync and scrub start having read/write collision issues and, eventially, that drives the readynasd process to 100% usage.  I can only guess here that readynasd also has a read/write collision but is not forgiving of it and just begins to try and assert access to the drives with more and more unsuccessful attempts.  Once that happens, the NAS is effectively offline.  I have only been able to determine this by having one or more SSH processes already open before starting the scrub, as even SSH login becomes impossible at that point.

         

        The scrub then slows even more, so it takes a long time to clear up.  But if you have days to wait, it does eventually seem to work it's way through it.

         

        The re-sync barely makes any progress while the scrub is ongoing, so I fail to see the point of starting them simultaneously.  The negative consequences, on the other hand, are abundantly clear.

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology! 

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

 

Learn More