NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.

Forum Discussion

Sandshark's avatar
Sandshark
Sensei
Sep 08, 2021

Scrub slowed to a crawl, then locked up system.

My RN4200V2 backup system running OS6.9.6 used to complete a scrub in a bit over a day.  It contained 12x3TB HGST enterprise drives in XRAID/RAID5 at that time.  A couple months ago, I replaced 4 of the drives witth 6TB HGST enterprise drives.  This weekend, it started it's first scheduled scrub since the drive replacement, and it has not gone well.

 

A lot of stuff just locked up, but I still had SSH access (probably because I had a session already open).  The GUI was offline and readynasd was not shown in top.  The MDADM re-sync was still going on.  An rsync process was also running, though the scheduled backups should have completed long before I became aware of the problem.  Eight kworker tasks (one for each thread of the 4-core processor, I think) were shown as running at nearly 100% like here:

 5243 root      20   0       0      0      0 R  97.6  0.0 130:47.01 kworker/u16:1
23148 root      20   0       0      0      0 R  97.6  0.0   5:22.06 kworker/u16:3
21806 root      20   0       0      0      0 R  96.9  0.0   5:26.25 kworker/u16:6
18709 root      20   0       0      0      0 R  95.6  0.0  28:37.88 kworker/u16:7
 2844 root      20   0       0      0      0 R  94.9  0.0 114:08.03 kworker/u16:4
 9524 root      20   0       0      0      0 R  94.9  0.0  93:10.28 kworker/u16:10
21102 root      20   0       0      0      0 R  94.9  0.0  21:58.70 kworker/u16:2
23149 root      20   0       0      0      0 R  90.9  0.0   5:23.49 kworker/u16:11

Note that this is not the real result at that time, it's from later (which I will describe), but it was essentially the same except for the run times.

 

I waited a few hours and the MDADM re-sync completed, but the BTRFS scrub had hardly moved.  I did a btrfs scrub cancel, waited for the kworkers to quiet down, then tried to get in via the GUI -- still no-go.  I then did a btrfs scrub resume and it basically went back to where it was process wise, but I lost a few TB of progress in btrfs status.  I then tried to re-start readynasd, and the system crashed and re-booted.

 

After the re-boot, I re-stated the scrub from SSH (so no MDADM re-sync).  It started out at a good pace, but quickly came to a crawl again:

root@RN4200B:~# date
Tue Sep  7 16:44:55 EDT 2021
root@RN4200B:~# btrfs scrub status /data
scrub status for a5dd5822-f567-4a74-825f-37cda442940d
        scrub started at Tue Sep  7 16:40:45 2021, running for 00:04:11
        total bytes scrubbed: 160.95GiB with 0 errors
root@RN4200B:~# date
Tue Sep  7 17:42:49 EDT 2021
root@RN4200B:~# btrfs scrub status /data
scrub status for a5dd5822-f567-4a74-825f-37cda442940d
        scrub started at Tue Sep  7 16:40:45 2021, running for 01:02:06
        total bytes scrubbed: 1.42TiB with 0 errors
root@RN4200B:~# date
Tue Sep  7 19:25:29 EDT 2021
root@RN4200B:~# btrfs scrub status /data
scrub status for a5dd5822-f567-4a74-825f-37cda442940d
        scrub started at Tue Sep  7 16:40:45 2021, running for 02:44:46
        total bytes scrubbed: 1.67TiB with 0 errors
root@RN4200B:~# date
Wed Sep  8 13:36:27 EDT 2021
root@RN4200B:~# btrfs scrub status /data
scrub status for a5dd5822-f567-4a74-825f-37cda442940d
        scrub started at Tue Sep  7 16:40:45 2021, running for 20:55:39
        total bytes scrubbed: 2.13TiB with 0 errors
root@RN4200B:~# date
Wed Sep  8 13:55:31 EDT 2021
root@RN4200B:~# btrfs scrub status /data
scrub status for a5dd5822-f567-4a74-825f-37cda442940d
        scrub started at Tue Sep  7 16:40:45 2021, running for 21:14:45
        total bytes scrubbed: 2.13TiB with 0 errors
root@RN4200B:~# btrfs scrub cancel /data
scrub cancelled
root@RN4200B:~# btrfs scrub status /data
scrub status for a5dd5822-f567-4a74-825f-37cda442940d
        scrub started at Tue Sep  7 16:40:45 2021 and was aborted after 21:18:02
        total bytes scrubbed: 2.13TiB with 0 errors
root@RN4200B:~# btrfs scrub resume /data
scrub resumed on /data, fsid a5dd5822-f567-4a74-825f-37cda442940d (pid=15870)
root@RN4200B:~# date
Wed Sep  8 14:00:10 EDT 2021
root@RN4200B:~# btrfs scrub status /data
scrub status for a5dd5822-f567-4a74-825f-37cda442940d
        scrub resumed at Wed Sep  8 13:59:47 2021, running for 21:18:27
        total bytes scrubbed: 2.13TiB with 0 errors
root@RN4200B:~# date
Wed Sep  8 14:03:23 EDT 2021
root@RN4200B:~# btrfs scrub status /data
scrub status for a5dd5822-f567-4a74-825f-37cda442940d
        scrub resumed at Wed Sep  8 13:59:47 2021, running for 21:21:37
        total bytes scrubbed: 2.13TiB with 0 errors
root@RN4200B:~# date
Wed Sep  8 15:01:15 EDT 2021
root@RN4200B:~# btrfs scrub status /data
scrub status for a5dd5822-f567-4a74-825f-37cda442940d
        scrub resumed at Wed Sep  8 13:59:47 2021, running for 22:19:26
        total bytes scrubbed: 2.15TiB with 0 errors

As you can see, I cancelled and re-started the scrub again, and it had no impact.  The cancel takes a very long time (minutes), if that's any clue.  The current top display has the eight 90% plus kworker tasks as before.  This is far sooner than it happened before, as it had previously gotten to over 13GB of completed scrub with 0 errors.

 

I didn't lose the GUI this time, but I also didn't have any backup tasks that tried to occur (which may or may not have anything to do with anything).  So, I downloaded the logs and I see nothing of interest.  In fact, is see nothing at all in some time frames when things weren't working well, so I wonder if logging also locked up.  Some of the drives in question have some ATA errors that happened a long time ago in another unit and have never grown since.  Other than that, I see nothing to indicate why it would slow down like this.

 

I certainly can't let this keep going at something like 0.02TB per hour.  I'm going to stop it and re-boot again, but I don't have a lot of faith that'll make a difference.  So, anyone have any ideas at all as to what I can do?

16 Replies

Replies have been turned off for this discussion
  • StephenB's avatar
    StephenB
    Guru - Experienced User

    smartctl -x might give you some info on whether any new errors are happening with the disks.  

     

    You might also consider pausing the scrub again, and running some disk test with smartctl.

    • Sandshark's avatar
      Sandshark
      Sensei

      Well, it did the same thing again at the same point, as I was afraid it would, so I aborted and will have to start running some tests on the drives.

      • Sandshark's avatar
        Sandshark
        Sensei

        OK, so initiating a drive self-test via the GUI found no errors (as also verified via smatrctl -x output).

         

        I did do some "poking around" in the BTRFS structure and noticed this:

         

        root@RN4200B# btrfs filesystem df /data
        Data, single: total=27.45TiB, used=26.11TiB
        System, RAID1: total=32.00MiB, used=3.84MiB
        Metadata, RAID1: total=17.00GiB, used=14.18GiB
        Metadata, DUP: total=1.00GiB, used=549.25MiB
        GlobalReserve, single: total=512.00MiB, used=0.00B

        The metadata for the added MDADM layer is separate in DUP format, not RAID1 with the rest.  I'm not sure if that has anything to do with it, but I don't think it's good. The BTRFS wiki gives this warning:  If the metadata is not converted from the single-device default, it remains as DUP, which does not guarantee that copies of block are on separate devices. If data is not converted it does not have any redundant copies at all.

         

         

        Now, since the "device" is an MDADM RAID, not a single drive, that's probably not quite so dire.  But it also doesn't seem, right.  It's been a long time since I had a volume with unequal drive sizes, and I may not have even noticed this then, so I don't know if it's normal for a vertically expanded ReadyNAS.  I do have a RAID5 of 12 drives, done by creating it in FlexRAID (with all drives 3TB at that time) and then switching to XRAID, so maybe it would be different in another case.  XRAID likes a 12-drive unit to have RAID6.

         

        I also looked back, and I did do a balance and scrub right after I added the last of the larger drives, and that scrub only took 26 hours.  I don't know for sure if both metadata sections contained data at that point, though the balance should have caused that to be the case.

         

        Any thoughts on whether a btrfs balance start -mconvert=raid1 would help, or how it might affect future expandability?  I did do that on my main NAS when I manually vertically expanded a volume that's in FlexRAID, but I don't have to worry there about maintaining XRAID expandability.  At the time I did that on the manual expansion, I didn't realize that the newly added layer's metadata being in DUP format was "normal".

         

        The next step would be to start pulling drives and testing them in a PC, a long time for my backup system to be down with 8x3TB and 4x6TB drives to test or to destroy the volume and re-create, hoping whatever is causing the prbolem either goes away or shows up with a vengeance.

         

         

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology! 

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

 

Learn More