EDA500 on RN516 - Scrub very slow

NETGEAR Employee Retired

Oct 29, 2017

I believe it is concurrent.

In the initial sync we can do things faster as there's not existing stuff to sync across. If you replace a disk in your EDA500 you'll find the sync to rebuild is longer than the initial sync when creating the volume.

The larger the disk capacity the longer things will take as there's more to check.

54 days for a scrub does still seem like a very long time even in an EDA500.

If moving the disks to the main chassis is not practical you could find that additional main units is better than using EDA500 units for you.

I would think a volume in any of our current main units would significantly outperform one in the EDA500

Sandshark

Sensei

Oct 29, 2017

OK, so this is what top looks like with a GUI-intiated scrub:

top - 16:48:32 up 7 days, 22:33,  1 user,  load average: 4.71, 1.55, 0.64
Tasks: 314 total,   1 running, 313 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  1.1 sy,  0.0 ni, 98.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  16297764 total, 15755896 used,   541868 free,    11572 buffers
KiB Swap:  1569788 total,        0 used,  1569788 free. 14749048 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2006 root      20   0       0      0      0 S   4.0  0.0   0:05.13 md123_raid5
15024 root      20   0       0      0      0 D   1.0  0.0   0:00.57 md123_resync
 4226 root      20   0    6344   1728   1600 S   0.3  0.0  10:42.07 wsdd2
 4452 root      20   0  661376  14428   9676 S   0.3  0.1   2:00.16 zerotier-one
    1 root      20   0  136976   7264   5144 S   0.0  0.0   0:09.02 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.18 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   0:12.67 ksoftirqd/0
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
    7 root      20   0       0      0      0 S   0.0  0.0   2:40.28 rcu_sched
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
    9 root      rt   0       0      0      0 S   0.0  0.0   0:02.24 migration/0
   10 root      rt   0       0      0      0 S   0.0  0.0   0:01.88 watchdog/0
   11 root      rt   0       0      0      0 S   0.0  0.0   0:01.86 watchdog/1
   12 root      rt   0       0      0      0 S   0.0  0.0   0:01.69 migration/1
   13 root      20   0       0      0      0 S   0.0  0.0   0:09.43 ksoftirqd/1
   15 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:0H

And here it is is with scrub initiated via SSH:

top - 16:58:13 up 4 min,  1 user,  load average: 1.90, 0.98, 0.42
Tasks: 316 total,   1 running, 315 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 13.8 sy,  0.0 ni, 86.1 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem:  16297764 total,   983956 used, 15313808 free,    11252 buffers
KiB Swap:  1569788 total,        0 used,  1569788 free.   565484 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   71 root      20   0       0      0      0 S  11.0  0.0   0:07.33 kworker/u8:3
 1035 root      20   0       0      0      0 S  10.0  0.0   0:07.31 kworker/u8:7
 1056 root      20   0       0      0      0 S   9.6  0.0   0:07.77 kworker/u8:10
   28 root      20   0       0      0      0 S   9.0  0.0   0:07.83 kworker/u8:1
   43 root      20   0       0      0      0 S   7.3  0.0   0:06.73 kworker/u8:2
 1054 root      20   0       0      0      0 S   6.7  0.0   0:05.52 kworker/u8:9
 5609 root      20   0   32168    204     16 S   3.0  0.0   0:03.04 btrfs
 1777 root       0 -20       0      0      0 S   1.3  0.0   0:01.54 kworker/2:1H
 4219 root      20   0    6344   1764   1628 S   0.7  0.0   0:00.90 wsdd2
 1745 root       0 -20       0      0      0 S   0.3  0.0   0:00.24 kworker/0:1H
 5673 root      20   0   28892   3068   2424 R   0.3  0.0   0:00.14 top
    1 root      20   0  136976   7136   5100 S   0.0  0.0   0:01.51 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   0:00.00 ksoftirqd/0
    4 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
    6 root      20   0       0      0      0 S   0.0  0.0   0:00.23 kworker/u8:0

Here is what it looks like if I start a scrub via the GUI and cancel it (but not the resync) via SSH:

top - 17:02:18 up 8 min,  1 user,  load average: 2.73, 1.85, 0.90
Tasks: 305 total,   1 running, 304 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  1.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  16297764 total,  1024016 used, 15273748 free,    11252 buffers
KiB Swap:  1569788 total,        0 used,  1569788 free.   579124 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2000 root      20   0       0      0      0 S   4.0  0.0   0:05.24 md123_raid5
 4219 root      20   0    6344   1764   1628 S   0.7  0.0   0:01.72 wsdd2
 6623 root      20   0       0      0      0 D   0.7  0.0   0:01.13 md123_resync
    7 root      20   0       0      0      0 S   0.3  0.0   0:00.20 rcu_sched
 4680 nut       20   0   17260   1508   1112 S   0.3  0.0   0:00.77 usbhid-ups
    1 root      20   0  136976   7136   5100 S   0.0  0.0   0:01.54 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   0:00.00 ksoftirqd/0
    4 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
    9 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 migration/0
   10 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 watchdog/0
   11 root      rt   0       0      0      0 S   0.0  0.0   0:00.00 watchdog/1
   12 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 migration/1
   13 root      20   0       0      0      0 S   0.0  0.0   0:00.00 ksoftirqd/1

So, yes, there is a resync in progress when scrub is initiated via the GUI that is not there when I do it via SSH. When I cancel the scrub via SSH, very little changes. If I resume the scrub with the resync still ongoing, it looks the same as if I never cancelled it. If I initiate just the scrub via SSH, then all of those kworker tasks are busy doing the scrub that are not even in the top ten processes when the resync is also ongoing. Clearly, something about having an ongoing resync is seriously affecting the scrub on the EDA500. It's not CPU availability -- the resync takes little CPU. So, it must be the I/O channel. My best guess is that the resync process is keeping the eSATA port multiplier "locked" to one drive and so the scrub process cannot access any others.

BTW, here is what cat /proc/mdstat reports on the sync:

md123 : active raid5 sdm3[0] sdq3[4] sdp3[3] sdo3[5] sdn3[1]
      7794659328 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  resync =  3.5% (69154404/1948664832) finish=1209.6min speed=25895K/sec

So the re-sync in and of itself is also not the issue, it will complete within a reasonable time (this is for an array half the size of the other, but even double this is reasonable).

I don't know the solution -- maybe doing the processes sequentially instead of concurrently. But it is definately an unacceptable situation that need attention. Any excuse that "a second NAS is a better solution" is just that -- an excuse. Netgear sold the product, and the OS should play well with it. I could accept a 3x or even maybe 4x longer task. 25x or more is just insane.

Sandshark

Sensei

Oct 30, 2017

OK, so I just let it keep going to see if the scrub process would speed up after the sync completes. Looking in an hour out from the resync was predicted to complete, something else had happened. Readynasd was now taking almost all of the available CPU time (it actually claimed to be 100.1%). It had a lower priority than the resync, so the resync time did not seem to be affected and it did complete when it was originally predicted to do so without readynasd misbehaving. The GUI was unavailable, but SMB access still worked. I did not test if access speed was affected. There was a scheduled rsync backup of the shares in the main volume (a pull from another NAS) which seems to have taken the normal amount of time. This could be something independent of the scrub, since I don't know when it happened, but it seems suspect since I have done nothing special during that period. Here is top at that point:

top - 13:58:44 up 21:05,  1 user,  load average: 6.80, 6.60, 6.49
Tasks: 502 total,   2 running, 500 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.7 us, 23.7 sy,  0.0 ni, 48.2 id, 24.2 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem:  16297764 total, 15686920 used,   610844 free,     4588 buffers
KiB Swap:  1569788 total,        0 used,  1569788 free. 14518584 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 4446 root      19  -1 21.687g  93096  23968 R 100.1  0.6 760:36.66 readynasd
 2000 root      20   0       0      0      0 D   5.3  0.0  57:42.57 md123_raid5
 6623 root      39  19       0      0      0 D   1.3  0.0  18:58.47 md123_resync
    7 root      20   0       0      0      0 S   0.3  0.0   2:08.55 rcu_sched
 4219 root      20   0    6344   1764   1628 S   0.3  0.0   3:49.45 wsdd2
27931 root      20   0   29024   3336   2544 R   0.3  0.0   0:00.03 top
    1 root      20   0  136976   7136   5100 S   0.0  0.0   0:02.29 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.03 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   0:01.15 ksoftirqd/0
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
    9 root      rt   0       0      0      0 S   0.0  0.0   0:00.16 migration/0
   10 root      rt   0       0      0      0 S   0.0  0.0   0:00.18 watchdog/0
   11 root      rt   0       0      0      0 S   0.0  0.0   0:00.25 watchdog/1
   12 root      rt   0       0      0      0 S   0.0  0.0   0:00.15 migration/1
   13 root      20   0       0      0      0 S   0.0  0.0   0:00.75 ksoftirqd/1
   15 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:0H

Now the resync is complete, all the kworker tasks are busy with the scrub, and the completion percentage has started to rise at a much faster pace. Readynasd also started behaving itself and the GUI is again available, so that pretty much shows that the CPU use it had was a result of the resync process. My best guess is that is was in a tight loop trying to do something that the resync process kept it from doing (probably related to the eSATA port expander) -- something that should probably have a time-out. Here is top now:

top - 15:19:47 up 22:26,  1 user,  load average: 1.80, 2.89, 4.98
Tasks: 310 total,   2 running, 308 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 17.7 sy,  0.0 ni, 82.2 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem:  16297764 total, 15566324 used,   731440 free,     4588 buffers
KiB Swap:  1569788 total,        0 used,  1569788 free. 14627468 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
31692 root      20   0       0      0      0 S  13.3  0.0   0:40.69 kworker/u8:0
29540 root      20   0       0      0      0 S  11.6  0.0   0:43.98 kworker/u8:2
32539 root      20   0       0      0      0 S  11.0  0.0   0:45.62 kworker/u8:7
31857 root      20   0       0      0      0 S  10.6  0.0   0:43.23 kworker/u8:6
  563 root      20   0       0      0      0 R  10.3  0.0   0:38.67 kworker/u8:8
31853 root      20   0       0      0      0 S  10.0  0.0   0:48.57 kworker/u8:5
 4446 root      19  -1 2460276  76200  23968 S   3.0  0.5 833:53.16 readynasd
 6888 root      20   0   32168    204     16 S   3.0  0.0   0:17.79 btrfs
 1777 root       0 -20       0      0      0 S   1.3  0.0   0:38.08 kworker/2:1H
 4219 root      20   0    6344   1764   1628 S   0.3  0.0   4:04.43 wsdd2
 4312 root      20   0  227556   7080   5240 S   0.3  0.0   0:36.02 nmbd
31548 root      20   0   29024   3260   2484 R   0.3  0.0   0:01.34 top
    1 root      20   0  136976   7136   5100 S   0.0  0.0   0:02.37 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.03 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   0:01.21 ksoftirqd/0
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
    7 root      20   0       0      0      0 S   0.0  0.0   2:20.73 rcu_sched

So, it appears to me that the concurrent resync with a scrub causes the scrub to be effectively held off until the resync completes. The multipplexing effect of the eSATA port expander is the factor that is likely different from the main volume. This results in an initially reported very slow scrub completion rate which is going to cause the average user to think the scrub will take an eternity. So is doing a resync concurrent with the scrub (or, in practice, doing one first) really the best thing to be doing (at least on the EDA500)?

And what about the process locking out the GUI (it comes up as "unavailable")? I can see it being slow, but locking it up keeps one from seeing the scrub is still progressing (except via SSH, which your average user will not use) and may cause the user to think the NAS needs rebooting. There are several snapshots that took place immediately after the resync finished that were likely held off, but that seems reasonable. A scheduled balance on the main volume was also attempted at that point, but failed to start, presumably because of the scrub on the other volume.

Forum Discussion

EDA500 on RN516 - Scrub very slow

Related Content

EDA500 Setup on RN516

Astonishingly slow scrub

RN516 finally EOL?

RN516 bios & memory

Firmware update of RN516 with two EDA500 failed

NETGEAR Academy

ProSupport for Business