- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
Scrub slows to a crawl -- is it SMR causing it?
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Scrub slows to a crawl -- is it SMR causing it?
I have a converted ReadyData5200 as my primary NAS. So, it's got a 2.67GHz quad-core (eight thread) Xeon X3450 and 8GB of dual-channel RAM, which make it pretty snappy. A few days ago, it started the first scrub since I had to replace one 6TB WD red and was one of those that ended up with an SMR drive before it became common knowledge that WD had made the switch. The scrub usualy takes a couple days. After 2 days, it was at only 82%, and I said to myself that's probably the penalty for SMR, but not really too bad. I expected it would complete soon. But then at the end of 3 days, it was only at 87%. Oddly, SSH showed that the MDADM re-sync had completed, which did not previously occur so far ahead (if at all ahead) from the BTRFS scrub. Overall network access was clearly being affected way more than for a past scrub, and top showed a lot more kworker and BTRFS activity than I remembered.
I then looked at the main log and saw that snapshots seemed to be occurring properly, but only the first set of snapshot trims that usually accompany them had occurred in that 3-day span. rsync backup jobs were also very bogged down.
I issued a btrfs scrub cancel /data from SSH and noticed a fury of activity in top and all of the "backed up" snapshot deletions now showed in the log and the backups completed (very quickly because they were for shares with little or no change from the last one). After the fury died down (<10 minutes), I then issued btrfs scrub resume /data and the scub is back to progressing at a normal rate and network access is also back to typical speed (only slightly slowed by the ongoing scrub). In just over an hour, the scrub has gone from 87% to 94% complete -- more progress than took a day before I interrupted the scrub. It's that significant.
Is this apparent "task backup" and it's affect on a scrub something inherent in BTRFS and/or how ReadyNASOS uses it (all BTRFS tasks seem to have the same priority of 20 and niceness of 0), perhaps pushed into this situation by the slightly longer scrub due to the SMR drive (assuming that actually extended it)? Or is the SMR drive itself to blame and it's it's background data re-arranging that backed up? Since one of the reasons I switched to rack-mount is that I had similar issues with scrubs and even balances on an EDA500, where access is bottlenecked by the eSATA interface, I think it's not just the SMR drive. But is there anything that can be done about it (by me, or in an OS update)? If my NAS has this issue with it's CPU and RAM, those with lesser hardware are clearly going to see it worse.
Yeah, I know the WD reds aren't rated for a 12-drive NAS. But I moved them from an RN516 and they actually seem to be operating just fine up until this. The one failure was not pre-mature.
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Re: Scrub slows to a crawl -- is it SMR causing it?
This sounds like it might be SMR. The straight mdadm resync should be handled fairly well, since it is sequential. But if you mix in other activity (like the btrfs scrub combined with snapshot deletions), the caching strategies don't work well.
But I have no way to test this, since I don't have an SMR internal drive. Not sure if you've seen this testing: https://www.servethehome.com/wd-red-smr-vs-cmr-tested-avoid-red-smr/
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Re: Scrub slows to a crawl -- is it SMR causing it?
I had not seen that particualr article, but was aware of the basic conclusions.
It seems pretty clear that the SMR drive is at least partly part of the problem. But with many more users potentially using SMR drives because they are unaware of this issue, I just wonder if Netgear can and will do something that can help. As I said, this seems very much in line with performance with an EDA500, where a drive I/O bottleneck starts to domino out of control as more and more processes with equal priority are spawned and try to get their "share of the pie". My NAS is powerful enough that readynasd didn't get locked out, but I have seen that happen on lesser NAS, cutting off GUI access.
I think I'm going to replace the 6TB EFAX with a larger one that's not SMR and just keep the current one as an emergency backup.