Forum Discussion

Sensei

Sep 08, 2021

Scrub slowed to a crawl, then locked up system.

My RN4200V2 backup system running OS6.9.6 used to complete a scrub in a bit over a day. It contained 12x3TB HGST enterprise drives in XRAID/RAID5 at that time. A couple months ago, I replaced 4 of ...

StephenB

Guru - Experienced User

Sep 30, 2021

Sandshark wrote:

It did significantly reduce usage of snapshot space in that backup share, which was my intent, but is there some reason that type of data (snapshots only) might be affecting the scrub? After the initial failure, I disabled all backup jobs for the re-tries, so an active ReadyDR transfer wasn't a factor at least then.

I don't see why switching to ReadyDR would matter.

One thing you could try is using -r on the scrub (read-only mode), just to see if that gets past the 2.15 TB barrier more quickly.

Sandshark

Sensei

Sep 30, 2021

I'll put a read-only scrub on the list of things to try (I never realized it was an option), but I may not have to. Since my backup unit was down for a while for drive testing, I manually kicked off all backup jobs once all the drives were back in so they'd be up to date before I disabled them for another try at a scrub. When it got to the ReadyDR job, it immediately failed, and top showed the btrfs-transacti process taking 100% of a CPU. It took over 40 minutes for that to clear up, and now it's successfully running a ReadyDR job that has a lot of catching up to do.

I'm beginning to suspect that having a ReadyDR job kick off in mid scrub is a bad thing, which did happen during the original, scheduled one (with concurrent MDADM re-sync). Then I disabled it, so nothing wrote to that share and alerted BTRFS that something was amiss with the it. Manually running it apparently did, and it looks like it had a positive effect. Now, to see if it fixed the scrub problem, too,

Sandshark
Sensei
Oct 04, 2021
OK, so all that btrfs-transacti activity had no effect. I started the read-only scrub. It still slowed down, but did get through the roadblock at 2.15GB faster -- maybe 24 hours. I'm not sure, because it was at something like 20 when I looked and it was still at a crawl, then the next morning it had obviously broken through. But it hit another at 9.5GB. It's been slogging through that for over 36 hours, and has made 1.1TB of progress. It's currently at 10.2TB, which the GUI says is 36%, with no errors.

The thing that's still noticeable is those multiple kworker threads all taking 100% of CPU. On my other unit during a scrub, and even with the MDADM re-sync in process, those ran under 40%, jumping to 100% only occasionally and briefly. The btrfs process rarely makes the top 20 in top on this one, where it jumped to #1 a lot on the other. So whatever is bogging down those kworker processes seems to be the problem. But what calls to the kernel are causing all this?
StephenB
Guru - Experienced User
Oct 05, 2021
I'm wondering if a balance would behave similarly???
Sandshark
Sensei
Oct 05, 2021
The last balance was after this first became an issue, and it completed in under two hours.

The scrub got through the last roadblock and has now been stuck on the next, at 16.7TB. It's back to progressing around 0.1 GB per 8 hours, but still with no errors.
rn_enthusiast
Virtuoso
Oct 09, 2021
Scour the dmesg logs. Either from log files or with command:
dmesg -T
Look for anything btrfs or disk related. Maybe there is some issue with NAS talking to the disks. Intermittent connections errors and that kind of stuff.
Sandshark
Sensei
Oct 10, 2021
I have already looked at the logs and found nothing that seemed related. The read-only scrub did complete in about 9 days, reporting zero errors.