NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.

Forum Discussion

Laserbait's avatar
Laserbait
Luminary
Jun 18, 2024

The many, MANY days of scrubbing.

RN516 - OS 6.10.9
Intel XEON E3-1265L V2 @ 2.50GHz
16GB RAM

6 x 10 TB HGST HUH721010ALE601 disks in a RAID 6


So, I decided to do a little preventative maintenance on my RN516, LOL:

 

It still has not finished, June 18th @ ~3pm Central time currently:

As you can see, the last time I ran a scrub, it finished in a couple of days (more or less).  Looks like it was about 60% utilized at that time.  Now I'm looking at 11 days (and counting). 😬 


I looked at the SMART (smartctl -x /dev/sdX) data for the disks, and it was clean.   

 


It's interesting though, because all 8 threads of the CPU are running near 100% utilization during this entire time, but the disk throughput is very low.  There is nothing else running on this array, but it is about 80% full. Memory usage is squat.

 


It's primarily the backup target for my Veeam agents, and ReadyDR from my RN316.   Veeam is pretty unhappy with it right now, as most of the backups are failing (presuming due to high latency from the maxed out CPU).  But, ReadyDR seems fine with it.  Hopefully it'll be done within the next day or two so I can turn my Veeam agents back on.

10 Replies

Replies have been turned off for this discussion
  • StephenB's avatar
    StephenB
    Guru - Experienced User

    I suggest looking for disk errors in the logs.

     

    You can check the progress with ssh (keeping in mind that the system is doing both a RAID mdadm scrub and a BTRFS scrub).

     

    How big is the volume?

    • Laserbait's avatar
      Laserbait
      Luminary

      Hey there!   The volume is 36.4TB (6x 10 TB disks in a RAID 6).   I didn't see any SMART errors.  What log do I find disk errors in?

      I checked dmesg, and the last messages that I see are from the resync of md127, and that completed in a pretty reasonable amount of time:

      [Fri Jun 7 02:23:13 2024] md: requested-resync of RAID array md127
      [Fri Jun 7 02:23:13 2024] md: minimum _guaranteed_ speed: 30000 KB/sec/disk.
      [Fri Jun 7 02:23:13 2024] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync.
      [Fri Jun 7 02:23:13 2024] md: using 128k window, over a total of 9761587136k.
      [Mon Jun 10 20:04:40 2024] md: md127: requested-resync done.

      There's nothing after that.

       

       

       

      I checked the status of the btrfs scrub, and it's diligently ticking away, no major errors reported:

       

      root@RN516:/home/admin# btrfs scrub status -R /dev/md127
      scrub status for 3e0931f1-84c5-45a4-9db5-e1d7f61ce675
      scrub started at Fri Jun 7 02:26:46 2024, running for 278:22:08
      data_extents_scrubbed: 498981037
      tree_extents_scrubbed: 2327562
      data_bytes_scrubbed: 32294854987776
      tree_bytes_scrubbed: 76269551616
      read_errors: 0
      csum_errors: 0
      verify_errors: 0
      no_csum: 1019
      csum_discards: 0
      super_errors: 0
      malloc_errors: 0
      uncorrectable_errors: 0
      unverified_errors: 0
      corrected_errors: 0
      last_physical: 32481801666560

       

      • StephenB's avatar
        StephenB
        Guru - Experienced User

        Laserbait wrote:

        I checked dmesg, and the last messages that I see are from the resync of md127, and that completed in a pretty reasonable amount of time:


        So it is just the BTRFS scrub that is glacially slow.  Looking at "Data Bytes Scrubbed", it appears to be 80% done (completing 32 TB out of 40). At that rate it would have about 68 hours to go from the time you measured that status.  So 2-3 more days to go.

         

        Generally with BTRFS, the time it takes balance and scrub operations to complete depends on how much work the file system needs to do.  So the next time you run it, it should go much quicker (likely completing before the mdadm sync).  If you've never run a balance, then that could also take a long time the first time you run it.

         

        It can be canceled from ssh if necessary.  But if you can live with the performance hit a bit longer, it might be better to let it finish.

         


        Laserbait wrote:

          What log do I find disk errors in?


        First, I doubt you'll find any.  They would have shown up during the mdadm sync, and should have been in the smart errors.

         

        When you are using ssh, you can just use journalctl to look at the logs.  I usually reverse the order with -r to display the newest entries first.  You can add --no-pager and then pipe the output through grep to find specific info.

         

        When looking at the log zip, disk errors could be in dmesg.log, system.log, kernel.log, and systemd-journal.log.

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology! 

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

 

Learn More