The many, MANY days of scrubbing.

Question

RN516 - OS 6.10.9Intel XEON E3-1265L V2 @ 2.50GHz16GB RAM6 x 10 TB HGST HUH721010ALE601 disks in a RAID 6So, I decided to do a little preventative maintenance on my RN516, LOL:&nbsp;It still has not finished, June 18th&nbsp;@ ~3pm Central time currently:As you can see, the last time I ran a scrub, it finished in a couple of days (more or less).&nbsp; Looks like it was about 60% utilized at that time.&nbsp; Now I'm looking at 11 days (and counting).&nbsp;😬&nbsp;I looked at the SMART (smartctl -x /dev/sdX) data for the disks, and it was clean. &nbsp;&nbsp;&nbsp;It's interesting though, because all 8 threads of the CPU are running near 100% utilization during this entire time, but the disk throughput is very low.&nbsp; There is nothing else running on this array, but it is about 80% full. Memory usage is squat.&nbsp;It's primarily the backup target for my Veeam agents, and ReadyDR from my RN316.&nbsp;&nbsp; Veeam is pretty unhappy with it right now, as most of the backups are failing (presuming due to high latency from the maxed out CPU).&nbsp; But, ReadyDR seems fine with it.&nbsp; Hopefully it'll be done within the next day or two so I can turn my Veeam agents back on.

StephenB · Answer

I suggest looking for disk errors in the logs.
&nbsp;
You can check the progress with ssh (keeping in mind that the system is doing both a RAID mdadm scrub and a BTRFS scrub).
&nbsp;
How big is the volume?

Laserbait · Answer

Hey there!&nbsp;&nbsp; The volume is 36.4TB (6x 10 TB disks in a RAID 6).&nbsp;&nbsp; I didn't see any SMART errors.&nbsp; What log do I find disk errors in?I checked dmesg, and the last messages that I see are from the resync of md127, and that completed in a pretty reasonable amount of time:[Fri Jun 7 02:23:13 2024] md: requested-resync of RAID array md127[Fri Jun 7 02:23:13 2024] md: minimum _guaranteed_ speed: 30000 KB/sec/disk.[Fri Jun 7 02:23:13 2024] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync.[Fri Jun 7 02:23:13 2024] md: using 128k window, over a total of 9761587136k.[Mon Jun 10 20:04:40 2024] md: md127: requested-resync done.There's nothing after that.&nbsp;&nbsp;&nbsp;I checked the status of the btrfs scrub, and it's diligently ticking away, no major errors reported:&nbsp;root@RN516:/home/admin# btrfs scrub status -R /dev/md127scrub status for 3e0931f1-84c5-45a4-9db5-e1d7f61ce675scrub started at Fri Jun 7 02:26:46 2024, running for 278:22:08data_extents_scrubbed: 498981037tree_extents_scrubbed: 2327562data_bytes_scrubbed: 32294854987776tree_bytes_scrubbed: 76269551616read_errors: 0csum_errors: 0verify_errors: 0no_csum: 1019csum_discards: 0super_errors: 0malloc_errors: 0uncorrectable_errors: 0unverified_errors: 0corrected_errors: 0last_physical: 32481801666560&nbsp;

StephenB · Answer

Laserbait&nbsp;wrote:
I checked dmesg, and the last messages that I see are from the resync of md127, and that completed in a pretty reasonable amount of time:

So it is just the BTRFS scrub that is glacially slow.&nbsp; Looking at "Data Bytes Scrubbed", it appears to be 80% done (completing 32 TB out of 40). At that rate it would have about 68 hours to go from the time you measured that status.&nbsp; So 2-3 more days to go.
&nbsp;
Generally with BTRFS, the time it takes balance and scrub operations to complete depends on how much work the file system needs to do.&nbsp; So the next time you run it, it should go much quicker (likely completing before the mdadm sync).&nbsp; If you've never run a balance, then that could also take a long time the first time you run it.
&nbsp;
It can be canceled from ssh if necessary.&nbsp; But if you can live with the performance hit a bit longer, it might be better to let it finish.
&nbsp;
Laserbait&nbsp;wrote:
&nbsp; What log do I find disk errors in?

First, I doubt you'll find any.&nbsp; They would have shown up during the mdadm sync, and should have been in the smart errors.
&nbsp;
When you are using ssh, you can just use journalctl to look at the logs.&nbsp; I usually reverse the order with -r&nbsp;to display the newest entries first.&nbsp; You can add --no-pager&nbsp;and then pipe the output through grep to find specific info.
&nbsp;
When looking at the log zip, disk errors could be in dmesg.log, system.log, kernel.log, and systemd-journal.log.

Sandshark · Answer

I have found that there are a couple of things that can really slow down a scrub:&nbsp; highly fragmented files and very large files.&nbsp; I have a very large Veracrypt volume on my main NAS, and the scrub grinds to a very slow rate when it gets to that file.&nbsp; Like with yours, the kworker processes jump to close to 100% (where they typically run in the low 90's) and the BTRFS process drops to around 0.3% (typically around 5%).&nbsp; But once it gets past that file, the speed jumps back up again.&nbsp; So, hopefully, yours has also hit something that slows it down but will speed up again once it gets past it.

Laserbait · Answer

Yeah, this volume/array is all very large files with a lot of change data.&nbsp; You know, now that you mention it, the last scrub that I did, I had run a balance and defrag shortly before the scrub, and the scrub only took 2ish days.So I might have to test that out!And the scrub is almost done!&nbsp;

Forum Discussion

The many, MANY days of scrubbing.

10 Replies

Related Content

Astonishingly slow scrub

RN424 NAS access during scrub

Length of Scrub Maintenance

A very LONG scrub

Volume Scrub duration

NETGEAR Academy

ProSupport for Business