Re: 6.6.1 Scrub Hammers CPU

btaroli · ‎2017-01-14

And when I say hammers, I mean it runs with -1 prio and causes 6-8 kernel worker threads each vying to consume 100% of CPU. So rabid is this consumption that all other background processes, including one's third party apps and Time Machine backups, just cease to function.

So with all the attention to being a good neighbor during resyncs and whatnot, why is scrub such a terrible neighbor? I'd love to enable it to run every month or quarter, but I can't stand to have my NAS more or less inoperable for my needs while it's running.

mdgm-ntgr · ‎2017-01-15

Which model is this on?

Can you send in your logs (see the Sending Logs link in my sig)?

btaroli · ‎2017-01-16

Sure, I'd be glad to. This is on a 528.

Just for the sake of completeness, here's top with just background running...

top - 03:34:19 up 3 days, 49 min,  2 users,  load average: 0.01, 0.03, 0.06
Tasks: 241 total,   2 running, 239 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.7 sy,  0.0 ni, 98.8 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  16303964 total, 15086028 used,  1217936 free,     2468 buffers
KiB Swap:  3139580 total,        0 used,  3139580 free. 13522476 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                
11971 root      20   0 3584900 356660  27084 S   4.0  2.2 182:46.02 /apps/dvblink-tv-server/dvblink_server                 
10318 root      20   0  233308  31464   5660 S   1.0  0.2  40:27.23 /usr/bin/python /apps/dropboxmanager/web/manage.py run+
 5138 root      20   0  992516  12028   8796 S   0.3  0.1  18:19.58 /opt/p2p/bin/leafp2p -n                                
 5419 root      20   0   28788   3060   2468 R   0.3  0.0   7:33.54 top                                                    
22590 root      20   0       0      0      0 S   0.3  0.0   0:00.30 [kworker/2:6]                                          
    1 root      20   0  202460   6504   4516 S   0.0  0.0   0:38.41 /sbin/init                                             
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.10 [kthreadd]                                             
    3 root      20   0       0      0      0 S   0.0  0.0   0:06.33 [ksoftirqd/0]                                          
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 [kworker/0:0H]                                         
    7 root      20   0       0      0      0 R   0.0  0.0   0:57.85 [rcu_sched]                                            
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 [rcu_bh]                                               
    9 root      rt   0       0      0      0 S   0.0  0.0   0:00.21 [migration/0]                                          
   10 root      rt   0       0      0      0 S   0.0  0.0   0:00.83 [watchdog/0]                                           
   11 root      rt   0       0      0      0 S   0.0  0.0   0:00.85 [watchdog/1]                                           
   12 root      rt   0       0      0      0 S   0.0  0.0   0:00.24 [migration/1]                                          
   13 root      20   0       0      0      0 S   0.0  0.0   0:03.26 [ksoftirqd/1]                                          
   15 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 [kworker/1:0H]

And here's what it looks like shortly after kicking off a scrub.

top - 03:34:19 up 3 days, 49 min,  2 users,  load average: 0.01, 0.03, 0.06
Tasks: 241 total,   2 running, 239 sleeping,   0 stopped,   0 zombie
top - 03:39:04 up 3 days, 53 min,  2 users,  load average: 2.67, 0.65, 0.26
Tasks: 249 total,   7 running, 242 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us, 95.1 sy,  0.0 ni,  3.5 id,  0.0 wa,  0.0 hi,  0.9 si,  0.0 st
KiB Mem:  16303964 total, 15111680 used,  1192284 free,     2468 buffers
KiB Swap:  3139580 total,        0 used,  3139580 free. 13537484 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                
14083 root      20   0       0      0      0 R  69.1  0.0   0:10.26 [kworker/u8:2]                                         
18741 root      20   0       0      0      0 R  59.1  0.0   0:10.26 [kworker/u8:0]                                         
22107 root      20   0       0      0      0 R  55.5  0.0   0:08.28 [kworker/u8:1]                                         
23262 root      20   0       0      0      0 R  55.1  0.0   0:08.74 [kworker/u8:9]                                         
23253 root      20   0       0      0      0 R  51.5  0.0   0:05.02 [kworker/u8:7]                                         
 9426 root      20   0       0      0      0 R  50.1  0.0   0:09.49 [kworker/u8:5]                                         
 2455 root      20   0       0      0      0 S  18.9  0.0   1:37.18 [md126_raid6]                                          
23229 root      19  -1   40340    212     12 S  10.0  0.0   0:02.30 btrfs scrub start /data                                
18976 root      20   0       0      0      0 S   4.3  0.0   0:02.26 [kworker/u8:8]                                         
11971 root      20   0 3584900 356660  27084 S   3.7  2.2 182:56.16 /apps/dvblink-tv-server/dvblink_server                 
 5335 root      19  -1 1589984  59196  12228 S   2.0  0.4   5:41.17 /usr/sbin/readynasd -v 3 -t                            
10318 root      20   0  233308  31464   5660 S   1.7  0.2  40:29.84 /usr/bin/python /apps/dropboxmanager/web/manage.py run+
23226 root      39  19       0      0      0 D   1.3  0.0   0:00.35 [md126_resync]                                         
 2335 root       0 -20       0      0      0 S   0.7  0.0   0:01.65 [kworker/1:1H]                                         
 5138 root      20   0  992516  12028   8796 S   0.7  0.1  18:20.82 /opt/p2p/bin/leafp2p -n                                
22590 root      20   0       0      0      0 S   0.7  0.0   0:01.27 [kworker/2:6]                                          
 2340 root       0 -20       0      0      0 S   0.3  0.0   0:27.69 [kworker/0:1H]

FramerV · ‎2017-01-16

Hi btaroli,

Have you sent your logs already?

I will be sending an inquiry to our subject matter expert about it if have you did.

Regards,

btaroli · ‎2017-01-16

Sent them attn to mdgm the same time I posted my reply.

FramerV · ‎2017-01-17

Hi btaroli,

Okay, I will give mdgm a heads up about your case.

Regards,

mdgm-ntgr · ‎2017-02-22

There have been some changes in 6.7.0, I think. Do you still have this problem on ReadyNASOS 6.7.0-T158 (Beta 1)?

btaroli · ‎2017-02-22

I'd certainly be willing to try it. Given some pain in the last upgrade or two, I've been a little shy to want to rush into betas. I'll take a look at the release notes and forums posts on this one and the give it a try if it's looking fairly quiet on the problem front.

Laserbait · ‎2017-05-12

I'm seeing this same issue on my RN316 running 6.6.1 with 6x4Tb drives in a RAID 5. All 4 cores are running at 95-100%. Did you update to 6.7.1 yet, if so, did that improve the scrub efficiency?

btaroli · ‎2017-05-12

I'm going to be trying 6.7.1 tonight after reading through the forums to see if there's been any post-upgrade issues. I only learned about 6.7.1 after seeing mentions of it over at DVBLogic's user forums...

Retired_Member · ‎2017-05-13

If you like, try running the scrub while antivirus service being disabled.

btaroli · ‎2017-05-13

I haven't had the AV service enable for years now. I was hopeful to do so Hefner it was replaced recently, but the false positive issue reversed that.

I gather 6.7.1 fixes that, but baby steps I shall take in re-enabling functions. 😄

Laserbait · ‎2017-05-13

@Retired_Member wrote:
If you like, try running the scrub while antivirus service being disabled.

I have never used the AntiVirus on my ReadyNAS's.

ctechs · ‎2017-05-13

We experience the "bad neighbor" scrub behavior too - I don't dare schedule a scrub during business hours: the responsiveness of the ReadyNAS over SMB suffers pretty severely. CPU load average 2.5-3 during a scrub on a ReadyNAS 516. OS 6.7.1, have never used the antivirus feature.

Can this be tamed down?

Laserbait · ‎2017-05-13

Exactly! It'd be great to be able to throttle the scrub somehow. Maybe a selectable priority, or limit it to a single thread/core.

btaroli · ‎2017-05-14

Well this is confirmed to still be happening in 6.7.1. I observe that overall I/O and wait time seem OK. Indeed the journal entries that pop up as the job starts suggest it's throttling on I/O rate. However the kernel worker processes still monopolize the CPU cores/threads. A certain amount of application CPU usage seems OK, but if you have anything like PLEX transcribes or AFS based Time Machine, which cause a fair amount of CPU activity themselves, then these other processes get starved to the point of being almost unusable.

Some of this is just Btrfs behavior, which I can compare to similar operations I do on even newer kernels on other Linux machines. But when you have a server environment where there is an expectation of responsiveness from applications, it can be problematic. On this front, the only thing I would consider a standard OS issue is Time Machine backups. These are CPU intensive and will be seriously delayed if not fail outright -- based on previous painful experiences. In this run I'm not allowing TM to even trigger until the scrub finishes. I know for a fact I'll wind up having to trash my whole backup archive if I let it try and fail.

As for PLEX, I can work around the issue by enabling direct play and disabling transcode in the client config. But hopefully we get to a point where scrubs will gracefully butt out when other activity requires CPU attention.

Laserbait · ‎2017-05-15

That's unfortunate to hear. I was hoping that it would be somewhat better. I think it's going on day four for my scrub at about 90% complete now.

Are you using compression on any of your volumes/shares?

Laserbait · ‎2017-05-16

I am running compression on one of my shares. I think I'll copy that data off to a new uncompressed share. I'll run another scrub to see what happens.

btaroli · ‎2017-07-04

Hmm. I've read more recently that IO throttling changed in more recent releases. I stated a scrub off while transcoding a show using PLEX. In the past, this would have immediately resulted in complaints that the server couldn't keep up with the transcode. In two hours I've had one such complaint. But overall it's been doing alright.

Laserbait · ‎2017-07-06

I just updated to 6.7.5 a few days ago, so I'm interested to see what happens with my next scrub. I also will be adding a EDA500 with 5x4TB drive in a few days, so this will be interesting.

StephenB · ‎2017-07-06

@Laserbait wrote:

I just updated to 6.7.5 a few days ago, so I'm interested to see what happens with my next scrub.

I just ran one on my 526x It took about 39 hours for 4x6TB RAID-5. The file indexing feature in 6.8.0 was also actively indexing.

Laserbait · ‎2017-07-06

6.8.0? Is that out already?

StephenB · ‎2017-07-06

@Laserbait wrote:

6.8.0? Is that out already?

It's at Beta 2.

btaroli · ‎2017-07-08

from another thread... once you start the scrub, find the process running "btrfs scrub ...." and ionice it (ionice -p ####) to see if it's set to idle. If it's "none", then try "ionice -p ##### -c 3" (which sets it to class 3/idle) and see how this affects the performance. I noticed a significant improvement in CPU-heavy tasks (e.g. transcoding) while the scrub runs.

Maybe 6.8.0 fixes this, but I'm a bit leery of beta releases of late. It would be nice to know if this is addressed there though.

Laserbait · ‎2017-07-08

I'm currenly on 6.7.5 and running a scrub. I do not see a procress that shows btrfs scrub.

ps -A | grep -i btrfs
1356 ?        00:00:00 btrfs-worker
1358 ?        00:00:00 btrfs-worker-hi
1359 ?        00:00:00 btrfs-delalloc
1360 ?        00:00:00 btrfs-flush_del
1361 ?        00:00:00 btrfs-cache
1362 ?        00:00:00 btrfs-submit
1363 ?        00:00:00 btrfs-fixup
1364 ?        00:00:00 btrfs-endio
1365 ?        00:00:00 btrfs-endio-met
1366 ?        00:00:00 btrfs-endio-met
1367 ?        00:00:00 btrfs-endio-rai
1368 ?        00:00:00 btrfs-endio-rep
1369 ?        00:00:00 btrfs-rmw
1370 ?        00:00:00 btrfs-endio-wri
1371 ?        00:00:00 btrfs-freespace
1372 ?        00:00:00 btrfs-delayed-m
1373 ?        00:00:00 btrfs-readahead
1374 ?        00:00:00 btrfs-qgroup-re
1375 ?        00:00:00 btrfs-extent-re
1376 ?        00:00:00 btrfs-cleaner
1377 ?        00:00:26 btrfs-transacti
1487 ?        00:00:00 btrfs-worker
1488 ?        00:00:00 btrfs-worker-hi
1489 ?        00:00:00 btrfs-delalloc
1490 ?        00:00:00 btrfs-flush_del
1491 ?        00:00:00 btrfs-cache
1492 ?        00:00:00 btrfs-submit
1493 ?        00:00:00 btrfs-fixup
1494 ?        00:00:00 btrfs-endio
1495 ?        00:00:00 btrfs-endio-met
1496 ?        00:00:00 btrfs-endio-met
1497 ?        00:00:00 btrfs-endio-rai
1498 ?        00:00:00 btrfs-endio-rep
1499 ?        00:00:00 btrfs-rmw
1500 ?        00:00:00 btrfs-endio-wri
1501 ?        00:00:00 btrfs-freespace
1503 ?        00:00:00 btrfs-delayed-m
1504 ?        00:00:00 btrfs-readahead
1505 ?        00:00:00 btrfs-qgroup-re
1506 ?        00:00:00 btrfs-extent-re
2884 ?        00:02:07 btrfs-cleaner
2885 ?        09:55:28 btrfs-transacti
26453 ?        00:17:54 btrfs <defunct>