Performance during scrub

btaroli
Prodigy
Jun 09, 2019
No, I don't see where the mdadm options were shown, but I do see this in the system journal.

Jun 08 11:53:07 bigbird kernel: md: requested-resync of RAID array md127 Jun 08 11:53:07 bigbird kernel: md: minimum _guaranteed_ speed: 30000 KB/sec/disk. Jun 08 11:53:07 bigbird kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync. Jun 08 11:53:07 bigbird kernel: md: using 128k window, over a total of 3902166784k. Jun 08 11:53:07 bigbird readynasd[5577]: scrub started on /data, fsid a00de7ae-c40a-4b0e-8eee-96bbebcf08ca (pid=10930) Jun 08 11:53:07 bigbird readynasd[5577]: Scrub started for volume data. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sda3 also saved as /var/backups/md/tmp/ata-ST4000DM000-1F2168_Z301Y1GN-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdb3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1ELCK8-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdd3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EGR7BDX-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sde3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EGS8SHX-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdf3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1EMQSK-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdg3 also saved as /var/backups/md/tmp/ata-HGST_HUH728080ALE604_2EK31A0X-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdh3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1ES9J9-part3. Jun 08 11:53:08 bigbird mdadm[4919]: /dev/sdi3 also saved as /var/backups/md/tmp/ata-ST8000VN0022-2EL112_ZA1EVFHK-part3. Jun 08 11:53:10 bigbird msmtpq[10967]: mail for [ -C /etc/msmtprc nasalerts@billsden.org --timeout=60 ] : send was successful Jun 08 11:53:14 bigbird mdadm[4919]: 10926 (process ID) old priority 0, new priority 19 Jun 08 11:53:14 bigbird mdadm[4919]: Backing up RAID configs... Jun 08 11:53:14 bigbird mdadm[4919]: found 33 raid config backups Jun 08 11:53:14 bigbird mdadm[4919]: pruning raid config /var/backups/md/raid_config_data-0_2017_07_05_222327.tar.xz Jun 08 11:53:14 bigbird mdadm[4919]: RebuildStarted event detected on md device /dev/md127, component device resync

I hadn't looked before but I DO see that there is a btrfs scrub running on /data simultaneous with mdadm's resync... which really doesn't seem like such a great idea, given that they are both heavy I/O tasks. I've canceled the btrfs scrub directly, and the machine is MUCH more responsive now. Perhaps I'll kick it off manually once the mdadm resync completes.

I can already see the data rate of the resync has increased from the minimum up to 70MB/sec, and in the past I've seen it get upwards of 130MB/sec. Plex playback (of the same program I was struggling with last night, is completely solid. So now I just need to know whether these two tasks are expected to be running simultaneously and, if not, why the heck they are behaving that way for me.
- Sandshark
  Sensei - Experienced User
  Jun 09, 2019
  I have also observed this, and mentioned it here to apparently deaf ears of Netgear. It's most egregious on an EDA500 and pretty bad on a 100-series NAS. Pretty clearly, the problem is that the scrub initiated by the GUI, either by schedule or manually, triggers the BTRFS scub and an MDADM re-sync at the same time. If I command a scrub or re-sync individually via SSH, there is no issue. I suspect the re-sync and scrub start having read/write collision issues and, eventially, that drives the readynasd process to 100% usage. I can only guess here that readynasd also has a read/write collision but is not forgiving of it and just begins to try and assert access to the drives with more and more unsuccessful attempts. Once that happens, the NAS is effectively offline. I have only been able to determine this by having one or more SSH processes already open before starting the scrub, as even SSH login becomes impossible at that point.
  
  The scrub then slows even more, so it takes a long time to clear up. But if you have days to wait, it does eventually seem to work it's way through it.
  
  The re-sync barely makes any progress while the scrub is ongoing, so I fail to see the point of starting them simultaneously. The negative consequences, on the other hand, are abundantly clear.
  - btaroli
    Prodigy
    Jun 09, 2019
    My observation is that the CPU usage isn't with readynasd but instead the kworker's. Given how the md resync goes, that's likely coming from the btrfs scrub, and of course when I stopped the scrub the kworkers shut up completely. But apart from the poor interaction between mdadm and btrfs/kernel, there is the not so insignificant matter of the actually file services and applications. Indeed I recall seeing lots of SQLite timeouts in the journal as well, suggesting that ROS/frontview was having issues just working itself.
    
    It's entirely possible for code to monitor the actual progress of both resync and scrub, so why these would be triggered simultaneously is baffling. I'm definitely disabling scheduled scrub on all the boxes in my purvey until this is addressed. What's concerning is that this isn't a new issue... I can recall seeing various complaints of poor scrub performance for many years (including a few of my own posts).
    
    I hesitate to log this as an idea, because it's more a bug than an enhancement. Do we have a vehicle to raise bugs? This definitely is one.

Forum Discussion

Related Content

RN424 NAS access during scrub

Astonishingly slow scrub

Disk errors & a slow scrub - can I replace the drive during a scrub?

ReadyNas 214 - BTRFS error during scrub - Stuck in readonly

RN102 slow performances

NETGEAR Academy

ProSupport for Business