And when I say hammers, I mean it runs with -1 prio and causes 6-8 kernel worker threads each vying to consume 100% of CPU. So rabid is this consumption that all other background processes, including one's third party apps and Time Machine backups, just cease to function. So with all the attention to being a good neighbor during resyncs and whatnot, why is scrub such a terrible neighbor? I'd love to enable it to run every month or quarter, but I can't stand to have my NAS more or less inoperable for my needs while it's running.

Hi btaroli, Have you sent your logs already? I will be sending an inquiry to our subject matter expert about it if have you did. Regards,

Sent them attn to mdgm the same time I posted my reply.

Hi btaroli, Okay, I will give mdgm a heads up about your case. Regards,

6.6.1 Scrub Hammers CPU | NETGEAR Communities

36 Replies

Replies have been turned off for this discussion

mdgm-ntgr

NETGEAR Employee Retired

Jan 15, 2017

Which model is this on?

Can you send in your logs (see the Sending Logs link in my sig)?

btaroli

Prodigy

Jan 16, 2017

Sure, I'd be glad to. This is on a 528.

Just for the sake of completeness, here's top with just background running...

top - 03:34:19 up 3 days, 49 min,  2 users,  load average: 0.01, 0.03, 0.06
Tasks: 241 total,   2 running, 239 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.7 sy,  0.0 ni, 98.8 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  16303964 total, 15086028 used,  1217936 free,     2468 buffers
KiB Swap:  3139580 total,        0 used,  3139580 free. 13522476 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                
11971 root      20   0 3584900 356660  27084 S   4.0  2.2 182:46.02 /apps/dvblink-tv-server/dvblink_server                 
10318 root      20   0  233308  31464   5660 S   1.0  0.2  40:27.23 /usr/bin/python /apps/dropboxmanager/web/manage.py run+
 5138 root      20   0  992516  12028   8796 S   0.3  0.1  18:19.58 /opt/p2p/bin/leafp2p -n                                
 5419 root      20   0   28788   3060   2468 R   0.3  0.0   7:33.54 top                                                    
22590 root      20   0       0      0      0 S   0.3  0.0   0:00.30 [kworker/2:6]                                          
    1 root      20   0  202460   6504   4516 S   0.0  0.0   0:38.41 /sbin/init                                             
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.10 [kthreadd]                                             
    3 root      20   0       0      0      0 S   0.0  0.0   0:06.33 [ksoftirqd/0]                                          
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 [kworker/0:0H]                                         
    7 root      20   0       0      0      0 R   0.0  0.0   0:57.85 [rcu_sched]                                            
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 [rcu_bh]                                               
    9 root      rt   0       0      0      0 S   0.0  0.0   0:00.21 [migration/0]                                          
   10 root      rt   0       0      0      0 S   0.0  0.0   0:00.83 [watchdog/0]                                           
   11 root      rt   0       0      0      0 S   0.0  0.0   0:00.85 [watchdog/1]                                           
   12 root      rt   0       0      0      0 S   0.0  0.0   0:00.24 [migration/1]                                          
   13 root      20   0       0      0      0 S   0.0  0.0   0:03.26 [ksoftirqd/1]                                          
   15 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 [kworker/1:0H]

And here's what it looks like shortly after kicking off a scrub.

top - 03:34:19 up 3 days, 49 min,  2 users,  load average: 0.01, 0.03, 0.06
Tasks: 241 total,   2 running, 239 sleeping,   0 stopped,   0 zombie
top - 03:39:04 up 3 days, 53 min,  2 users,  load average: 2.67, 0.65, 0.26
Tasks: 249 total,   7 running, 242 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us, 95.1 sy,  0.0 ni,  3.5 id,  0.0 wa,  0.0 hi,  0.9 si,  0.0 st
KiB Mem:  16303964 total, 15111680 used,  1192284 free,     2468 buffers
KiB Swap:  3139580 total,        0 used,  3139580 free. 13537484 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                
14083 root      20   0       0      0      0 R  69.1  0.0   0:10.26 [kworker/u8:2]                                         
18741 root      20   0       0      0      0 R  59.1  0.0   0:10.26 [kworker/u8:0]                                         
22107 root      20   0       0      0      0 R  55.5  0.0   0:08.28 [kworker/u8:1]                                         
23262 root      20   0       0      0      0 R  55.1  0.0   0:08.74 [kworker/u8:9]                                         
23253 root      20   0       0      0      0 R  51.5  0.0   0:05.02 [kworker/u8:7]                                         
 9426 root      20   0       0      0      0 R  50.1  0.0   0:09.49 [kworker/u8:5]                                         
 2455 root      20   0       0      0      0 S  18.9  0.0   1:37.18 [md126_raid6]                                          
23229 root      19  -1   40340    212     12 S  10.0  0.0   0:02.30 btrfs scrub start /data                                
18976 root      20   0       0      0      0 S   4.3  0.0   0:02.26 [kworker/u8:8]                                         
11971 root      20   0 3584900 356660  27084 S   3.7  2.2 182:56.16 /apps/dvblink-tv-server/dvblink_server                 
 5335 root      19  -1 1589984  59196  12228 S   2.0  0.4   5:41.17 /usr/sbin/readynasd -v 3 -t                            
10318 root      20   0  233308  31464   5660 S   1.7  0.2  40:29.84 /usr/bin/python /apps/dropboxmanager/web/manage.py run+
23226 root      39  19       0      0      0 D   1.3  0.0   0:00.35 [md126_resync]                                         
 2335 root       0 -20       0      0      0 S   0.7  0.0   0:01.65 [kworker/1:1H]                                         
 5138 root      20   0  992516  12028   8796 S   0.7  0.1  18:20.82 /opt/p2p/bin/leafp2p -n                                
22590 root      20   0       0      0      0 S   0.7  0.0   0:01.27 [kworker/2:6]                                          
 2340 root       0 -20       0      0      0 S   0.3  0.0   0:27.69 [kworker/0:1H]

FramerV
NETGEAR Employee Retired
Jan 17, 2017
Hi btaroli,

Have you sent your logs already?

I will be sending an inquiry to our subject matter expert about it if have you did.

Regards,

btaroli
Prodigy
May 13, 2017
I haven't had the AV service enable for years now. I was hopeful to do so Hefner it was replaced recently, but the false positive issue reversed that.

I gather 6.7.1 fixes that, but baby steps I shall take in re-enabling functions. :D
ctechs
Apprentice
May 13, 2017
We experience the "bad neighbor" scrub behavior too - I don't dare schedule a scrub during business hours: the responsiveness of the ReadyNAS over SMB suffers pretty severely. CPU load average 2.5-3 during a scrub on a ReadyNAS 516. OS 6.7.1, have never used the antivirus feature.

Can this be tamed down?
- Laserbait
  Luminary
  May 13, 2017
  Exactly! It'd be great to be able to throttle the scrub somehow. Maybe a selectable priority, or limit it to a single thread/core.
  - btaroli
    Prodigy
    May 15, 2017
    Well this is confirmed to still be happening in 6.7.1. I observe that overall I/O and wait time seem OK. Indeed the journal entries that pop up as the job starts suggest it's throttling on I/O rate. However the kernel worker processes still monopolize the CPU cores/threads. A certain amount of application CPU usage seems OK, but if you have anything like PLEX transcribes or AFS based Time Machine, which cause a fair amount of CPU activity themselves, then these other processes get starved to the point of being almost unusable.
    
    Some of this is just Btrfs behavior, which I can compare to similar operations I do on even newer kernels on other Linux machines. But when you have a server environment where there is an expectation of responsiveness from applications, it can be problematic. On this front, the only thing I would consider a standard OS issue is Time Machine backups. These are CPU intensive and will be seriously delayed if not fail outright -- based on previous painful experiences. In this run I'm not allowing TM to even trigger until the scrub finishes. I know for a fact I'll wind up having to trash my whole backup archive if I let it try and fail.
    
    As for PLEX, I can work around the issue by enabling direct play and disabling transcode in the client config. But hopefully we get to a point where scrubs will gracefully butt out when other activity requires CPU attention.
btaroli
Prodigy
Jul 08, 2017
Did you verify the scrub is still running? How much data is in your volume? If memory serves, it only scrubs data and not free space. I'll have to check action of -A as well, because it seems you picked up kernel threads and I wonder if it includes use space stuff too? Not familiar with that option and can't check until my toddler is down for his nap.
- Laserbait
  Luminary
  Jul 09, 2017
  Yes, scrub is running, currently 4.85% completed.
  - btaroli
    Prodigy
    Jul 10, 2017
    https://manpages.debian.org/stretch/procps/ps.1.en.html
    
    -A is a synonym for -a, selecting all processes
    
    What you seem to be missing is an option for output format. I generally use -f, but there are others.
    
    If you only see the name of the program being run, then ALL Btrfs commands show up as "btrfs". You need to see the command arguments in order to know which function is being called, such as scrub.
    
    So try "ps -Af"
btaroli
Prodigy
Jul 12, 2017
In the case of PLEX and DVBLink it's definitely the transcoding that was most affected. Recordings and other operations seemed ok.

Time Machine (really the AFP file service) would get so starved for cycles that my backup would corrupt and get flagged to be replaced (as TM is want to do) whenever it attempted to run during a scrub.

I fully appreciate the de ription of the scrub processes you present here, but I can tell you that until I forced them into idle with ionice I was having exactly the problems described. And once having done it the behaviors stopped.

I was rather hope to let the scrub finish, since this value has never been scrubbed fully, and it's 5 days in now at about 55%. But I'm happy to restart the process to compare the ioprio on the breads as reported and detail/verify the outcome of the tweaks. Been through a few iterations during this run. But I am in a state now where things are humming along nicely, even if the scrub is taking a while -- which is fine by me.

The affect on the other processes (whether it's CPU, IO, or a combination of both) is real and extremely annoying, however. So something does need adjustment here.

I do observed on Fedora that btrfs scrub behaves as described on the btrfs wiki and your description. The base process is in ioprio B4, as well as one of the threads, the thread doing the effective IO is in idle. Course on this system effective CPU utilization is under 40% (four threads) and iowait is almost nil. Debian is of course quite different in how kernel threads operate, but this is what I see on fedora.

I'll restart the scrub on the NAS later tonight and see what I see there to compare. Not totally apples and apples, but I think it's odd how btrfs stopped getting in the way of other services until it was ionice'd directly.

I'll report back once I've had a chance to observe it again without manual adjustments.
- Skywalker
  NETGEAR Expert
  Jul 12, 2017
  It's worth noting that the resources consumed by btrfs scrub can vary widely depending on the files it's scrubbing at the time. Very fragmented files will result in a lot more I/O wait time, and more CPU consumption. Also, a ReadyNAS volume scrub also includes a MD RAID scrub, which adds to the I/O load. MD is generally pretty good at yielding to other I/O consumers, but it's still a minimum 30MB/sec used for that plus the CPU usage for parity calculation.
  - Michael_Oz
    Luminary
    Jul 13, 2017
    Still on 6.7.4.
    Started scrub.
    Three threads as described above, main process & two threads (shown with H command to htop)
    
    PPID PID USER IO IORR IOWR IO PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 1 29003 root 41462 41462 0 B3 19 -1 32180 200 16 S 5.6 0.0 0:26.28 `- btrfs scrub start /N316AR6 1 29005 root 0 0 0 B3 19 -1 32180 200 16 S 0.0 0.0 0:00.03 | `- btrfs 1 29004 root 41462 41462 0 id 19 -1 32180 200 16 D 5.6 0.0 0:26.26 | `- btrfs
    I'm reading that as the IO/IORR/IOWR & CPU usage is accounted for in the main process too, rather than both doing I/O & CPU.
    Note active thread is idle iopriority, nice -1 (as most things spawned from readynas seem to inherit -1).
    
    However, sort by CPU, (K command to show kernal threads), everything with CPU > 0.
    
    PPID PID USER IO IORR IOWR IO PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 2 29066 root 0 0 0 B4 20 0 0 0 0 R 39.9 0.0 3:52.37 kworker/u8:10 2 29043 root 0 0 0 B4 20 0 0 0 0 R 38.2 0.0 4:47.15 kworker/u8:9 2 28656 root 0 0 0 B4 20 0 0 0 0 R 37.0 0.0 4:37.22 kworker/u8:2 2 28793 root 0 0 0 B4 20 0 0 0 0 R 37.0 0.0 4:42.18 kworker/u8:3 2 30086 root 0 0 0 B4 20 0 0 0 0 R 35.8 0.0 1:05.82 kworker/u8:14 2 28903 root 57.2 57.2 0 B4 20 0 0 0 0 S 33.4 0.0 4:37.18 kworker/u8:8 2 28071 root 0 0 0 B4 20 0 0 0 0 R 33.4 0.0 4:27.62 kworker/u8:0 2 29095 root 19.2 19.2 0 B4 20 0 0 0 0 R 31.6 0.0 1:44.06 kworker/u8:13 2 28902 root 0 0 0 B4 20 0 0 0 0 R 31.0 0.0 4:56.74 kworker/u8:7 2 1515 root 0 0 0 B4 17 -1 0 0 0 R 25.6 0.0 35h04:21 md127_raid6 2 27507 root 0 0 0 B4 20 0 0 0 0 R 17.9 0.0 4:31.39 kworker/u8:1 2 29067 root 0 0 0 B4 20 0 0 0 0 R 17.9 0.0 4:30.13 kworker/u8:11 1 29003 root 45490 45490 0 B3 19 -1 32180 200 16 S 6.0 0.0 1:12.96 btrfs scrub start /N316AR6 1 29004 root 45517 45517 0 id 19 -1 32180 200 16 D 6.0 0.0 1:12.87 btrfs 2 29001 root 0 0 0 ?? 39 19 0 0 0 R 3.0 0.0 2:22.40 md127_resync 28517 28522 root 0 0 0 B4 20 0 29460 3688 3004 R 3.0 0.2 0:56.68 htop 2 1402 root 0 0 0 B0 0 -20 0 0 0 S 1.8 0.0 0:54.53 kworker/1:1H 2 1401 root 0 0 0 B0 0 -20 0 0 0 S 1.2 0.0 0:50.00 kworker/0:1H 2 1359 root 0 0 0 B0 0 -20 0 0 0 S 0.6 0.0 3:29.12 kworker/2:1H 2 1389 root 0 0 0 B0 0 -20 0 0 0 S 0.6 0.0 3:34.20 kworker/3:1H 2 7 root 0 0 0 B4 20 0 0 0 0 R 0.6 0.0 4:10.65 rcu_sched 1 3703 nut 0 0 0 B4 20 0 17240 1296 932 S 0.6 0.1 1:31.42 /lib/nut/usbhid-ups -a UPS
    Note all the CPU chewed up by kworker threads, Nice 0, doing little IO. (kworker IO increased later, similar numbers as above, but more threads doing IO)
    
    btaroli's media processes (infering from the above iopriority B7/B6 - I don't know what they are doing) will have lesser nice values, and be CPU constrained. Those at B4 will have Nice 0, round-robin with all those kworker's, so ~1/12th CPU (?5/17th?).
    
    How changing the iopriority of the other two threads can change this I can't fathom ATM.
    
    Perhaps media threads with a 'interactive' workload should run nice -1 for now.
    
    Longer term there should be a gap between current -1 processes (readynas etc) and default worker thread priority (currently 0 ), so intermidiate priority things can fit in the middle??
    
    I'll repeat this on 6.7.5 when I get around to it, I'm currently juggling 10TB disks upgrading...nothing happens fast...

Forum Discussion

6.6.1 Scrub Hammers CPU

36 Replies

Related Content

Astonishingly slow scrub

RN424 NAS access during scrub

Length of Scrub Maintenance

A very LONG scrub

Volume Scrub duration

NETGEAR Academy

ProSupport for Business