Planning Ahead for Capacity Upgrade

btaroli · ‎2016-12-01

Well, the ol' 4TB based volume in my 516 is down below 3TB free space. FWIW, this somehow got created as a RAID-5 under X-RAID2, but with 4TB disks it wasn't so bad. After some thought and looking at drive prices, I decided I just couldn't wait for 10TB (or bigger) to hit the market and drive the 8TB price down.

So now what? I've got a new unit. The cost of 8TB disks is such that I'm not going to buy 6 immediately. I had some other spare drives around for testing scenarios, but I decided to pull the trigger on 3 8TB disks for now (just to fill slots I didn't have other drives for initially). Well, ultimately, I learned 4 is really the minimum for a RAID-6 md stripe set, so I knew I'd need more drives... but it turns out (through trial and error) that HOW I install drives in the new NAS makes a big difference on the resulting capacity. And it's not really intuitive why it behaves the way it does without considering how md decideses to create stripes, and when that happens.

I began my efforts with the 3x8TB disks I purchased and 2x4TB spares (really drives I'd removed to resolve SMART warnings). That's more than 4, so I should be fine. And in each cycle I chose to do Factory Reset in new unit, just to keep things "simple." Well, with these 5 drives, I wound up with a RAID-5 configuration. Before talking about capacity, why did this happen? A little fun with mdadm logged in via ssh helps establish what it did.

root@elmo:~# mdadm --misc --detail /dev/md126

/dev/md126:
        Version : 1.2
  Creation Time : Thu Dec  1 01:10:10 2016
     Raid Level : raid5
     Array Size : 7813757824 (7451.78 GiB 8001.29 GB)
  Used Dev Size : 3906878912 (3725.89 GiB 4000.64 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Thu Dec  1 01:10:10 2016
          State : active, resyncing (DELAYED) 
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : 2fe75b42:data-1  (local to host 2fe75b42)
           UUID : e61cd8f4:c4d89150:febd0a00:6253b716
         Events : 1

    Number   Major   Minor   RaidDevice State
       0       8        4        0      active sync   /dev/sda4
       1       8       20        1      active sync   /dev/sdb4
       2       8       36        2      active sync   /dev/sdc4

root@elmo:~# mdadm --misc --detail /dev/md127

/dev/md127:
        Version : 1.2
  Creation Time : Thu Dec  1 01:09:18 2016
     Raid Level : raid5
     Array Size : 15608667136 (14885.58 GiB 15983.28 GB)
  Used Dev Size : 3902166784 (3721.40 GiB 3995.82 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent

    Update Time : Thu Dec  1 01:20:48 2016
          State : active, resyncing 
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

  Resync Status : 1% complete

           Name : 2fe75b42:data-0  (local to host 2fe75b42)
           UUID : a8846149:7267ee9b:98343506:d8b7222e
         Events : 7

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
       2       8       35        2      active sync   /dev/sdc3
       3       8       51        3      active sync   /dev/sdd3
       4       8       67        4      active sync   /dev/sde3

/dev/md126 and /dev/md127 are the two devices created when the disks were detected and X-RAID2 did it's thing. Why two? Well, with two different sized disks, two different stripes were created. One stripe, based on the smaller disks (4TB) actually includes all 5 disks. This means that only half of the 8TB disks are consumed by this stripe. What happens to the rest? X-RAID2 determines there are three drives with 4TB available and creates another stripe of 4TB.

The tricky, and important, thing about this is that only 3 drives have that extra 4TB capacity. And THIS is why X-RAID2 decides that RAID-5 is the right setup here... RAID-6 requires a minimum of four drives, and it doesn't seem that md wants to mix RAID levels between different stripes in the array. OK.

This configuration yields 24TB (minus overhead).

But I really want double-parity, so I reduce the installed drives to 1, factory reset, turn off X-RAID2, install the other four disks, delete the default "data" volume, which is in JBOD mode with just the one disk, and then select all five disks to create a new volume under FlexRAID mode. It gives me the option of choosing RAID-6. Upon doing so, the volume that it establishes shows only 10TB available (12TB minus overhead). Counterintutive! Why did this happen? Another peek at mdadm output:

root@elmo:~# mdadm --misc --detail /dev/md127

/dev/md127:
        Version : 1.2
  Creation Time : Thu Dec  1 01:32:50 2016
     Raid Level : raid6
     Array Size : 11706506496 (11164.19 GiB 11987.46 GB)
  Used Dev Size : 3902168832 (3721.40 GiB 3995.82 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent

    Update Time : Thu Dec  1 01:32:51 2016
          State : active, resyncing 
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

  Resync Status : 0% complete

           Name : 2fe75b42:data-0  (local to host 2fe75b42)
           UUID : f97925ee:989f6849:d6e4bd28:d786bc10
         Events : 1

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
       2       8       51        2      active sync   /dev/sdd3
       3       8       35        3      active sync   /dev/sdc3
       4       8       67        4      active sync   /dev/sde3

There's no /dev/md126 this time (not sure why), just /dev/md127... but the meaning is that there is ONE STRIPE, yielding 3x4TB before overhead... which is were we get 10TB free space. The output also reflects that the RAID level is 6. But what happened to the unused 4TB on the 8TB disks? Well, it seems that when you request RAID-6, it will only create stripes where it can. Since there are only 3 drives with an extra 4TB available, no second RAID-6 stripe is created. So that extra parity made a big difference in this case.

But in addition to the mixture of disk sizes, which disks are available at the time the volume is created (versus added later, with X-RAID2) also matters. If, for example, i begin with my three 8TB disks -- we'll assume RAID-5 here -- then I get 16TB (minus overhead). If I then insert my two 4TB disks later, a tricky thing happens. md created an 8TB stripe with my first 3 disks. But that means that the newly inserted 4TB disks can only be mirrored, and the way md arranges this new stripe isn't quite the way it did it with the first batch. Reminds me of concatenating stripes, from the old block storage days. So instead, overall, I get 16+4=20TB (before overhead). So the order in which the disks were available changes what was 24TB (in the first example) to 20TB here. Hmm.

I guess in the normal case for X-RAID2, we'd expect smaller disks to be installed first, so that when larger disks are added or swapped in their additional capacity will cause the creation of a new stripe. But when you're Factory Resetting a NAS with different disk sizes, it's worth keeping this in mind. If the smaller disks aren't present at the time the reset is done, then it really affects the outcome.

So of course, I have more 8TB disks coming now, and I'll need to wait and do a factory reset once they're here in order to execute on my plan. Since my requirement is double parity, I must have at least 4 disks to participate in each stripe. And since I expect to have both 4 and 8 TB disks present, I must ensure that I either start with only the 4TB disks installed, or have both sizes present at the start .. so that multiple stripes are created (as the usual X-RAID2 growth pattern expects).

Given that my all my working 4TB drives are in my 6-bay NAS, that means I need to use my older 4TB cast-aways to help initialize the new volume. After considering that 4x8TB + 2x4TB (RAID-6) would yield 4x4TB + 2x4TB space (24TB) and my current volume is approx 20TB, I'm opting for 4x8TB and 2x4TB initial mix instead, giving 4x4TB + 2x4TB (24TB). Additionally, it gives me the option of adding another 8TB by upgrading the remaining 4TB drives as I wish later.

But what this little experience also points out to me is that upgrading capacity later may actually not just be a matter of adding just one or two drives when 12 or 16TB drives come along. Why? Well, even if md lets me swap in a 12TB drive -- yes, I know they don't exist and the real capacity will probably be different, but we have 4TB stripes now so bear with me -- the new 4TB stripe would have just ONE drive in it... no redundancy. And as we already established, to have full RAID-6 redundancy you have to have at least four disks. So to play by the rules, if I were to swap 12TB drives in (yes, they need to be done one at a time when hot-swapping later) then the goal really would be to do at least FOUR such drives before considering the upgrade completed. Otherwise, I may have a mix of RAID levels that could make me vulnerable to data loss if two (or even one) drives were lost... until I have at least 4 drives in the new stripe.

Funny how complicated something so simple can be. I've read warnings elsewhere about the real redundancy level in capacity upgrade scenarios... and after learning a bit in this experience I thought it might be nice to share WHY you might experience reduced effective redundancy when upgrading.

Enjoy!

P.S. It would be really nice if someday we evolved this situation to something more like ZFS or Btrfs, which don't stripe disks at all... each disk just adds blocks to the pool and the filesystem figures out how many replicas of each block it needs for the desired redundancy level -- not a simple thing. And, if you make a big change in number or size of drives, you can even request the filesystem run around and "rebalance" blocks according to the new space available -- though not everyone supports this directly.

mdgm-ntgr · ‎2016-12-01

The volume will only expand when redundant space can be added. So if you have one 12TB disk after the RAID-6 volume is rebuilt you will still have dual-redundancy.

After you've created the RAID-6 volume you can re-enable X-RAID.

In fact depending what disks are installed, with a RAID-5 volume with three or more disks you could disable X-RAID and designate it so that when the next empty slot is filled it is used to add parity (i.e. convert to RAID-6). This conversion does take a long time though.

View solution in original post

mdgm-ntgr · ‎2016-12-01

The volume will only expand when redundant space can be added. So if you have one 12TB disk after the RAID-6 volume is rebuilt you will still have dual-redundancy.

After you've created the RAID-6 volume you can re-enable X-RAID.

In fact depending what disks are installed, with a RAID-5 volume with three or more disks you could disable X-RAID and designate it so that when the next empty slot is filled it is used to add parity (i.e. convert to RAID-6). This conversion does take a long time though.

btaroli · ‎2016-12-01

I did notice the option for adding a parity disk along the way, in FlexRAID mode, but these drives are a bit too large to want to wait around for that to happen. I considered the possibility of migrating the disks over to the new NAS. I suppose if that worked (both are on 6.6) then I could add a second parity disk to the volume and then single swap in 8TB disks...? Might save time overall, instead of syncing data and setup around...

StephenB · ‎2016-12-01

@btaroli wrote:

Funny how complicated something so simple can be.

It would be nice if Netgear had an upgrade planning tool app built into the NAS - showing you what would happen on your specific system if you upgrade or add disks. I added that to the idea exchange here: https://community.netgear.com/t5/Idea-Exchange-for-ReadyNAS/Capacity-Planning-App/idi-p/1180068#M728

@btaroli wrote:

P.S. It would be really nice if someday we evolved this situation to something more like ZFS or Btrfs, which don't stripe disks at all...

FWIW I don't think you are using the term "stripes" correctly above. Striping generally means the finer-grained organization of data+parity blocks. A RAID array of equal sized disks is still striped. I usually use the term "layer" where you are using "stripe" - though there might be a better term for it. That is, with XRAID2, mdadm is creating multiple raid layers (each with its own set of partitions) and is assembling them into a single volume. Each layer has its own raid striping.

Another aside - dual-redundancy could potentially be achieved if there are 3 disks of the largest size, but Netgear has chosen not to do it. You'd use triple RAID-1 for the uppermost RAID layer.

I agree long term that there could be a lot of benefits to integrating parity blocks into the file system itself. From a performance perspective, RAID resync now syncs everything, even free space. Why protect the free space?. Also, the newer schemes under development could potentially let you set different levels of protection for different files - allowing you to reduce or eliminate protection on data that doesn't need it, and increase protection on data that is more precious to you. Metadata could also be protected at a higher level, reducing the chance of file system corruption.

I think it will take some time before this approach will be production ready though.

btaroli · ‎2016-12-01

Yes, I was being a bit loose with "stripe" but hopefully it's enough to get the idea across anyway. 🙂 I'll check out that suggestion you posted. I think that would be awesome, and help avoid potentially nasty surprises for us users. 😄

btaroli · ‎2016-12-01

So I've got the disks over in the other box now. And I've begun the process of adding a parity disk. Given the fullness of the volume, it suggests it will take 10 days! We shall see... but I have no reason to doubt it. heh

StephenB · ‎2016-12-01

@btaroli wrote:

So I've got the disks over in the other box now. And I've begun the process of adding a parity disk. Given the fullness of the volume, it suggests it will take 10 days! We shall see... but I have no reason to doubt it. heh

The resync time doesn't depend on the volume fullness. But of course it does take a while - every sector on every disk in the data volume needs to be either read or written at least once.

For other posters - OS-6 RAID doesn't actually use the "parity disk" idea. Parity and data blocks are evenly distributed across all the disks. That spreads the I/O across all drives, which has performance benefits.

btaroli · ‎2016-12-01

Certainly md's idea of resync is different that Btrfs' 😉 I guess that's one of the interesting parts of this pairing.

mdgm-ntgr · ‎2016-12-01

RAID-6 requires a minimum of four disks for each layer.

StephenB · ‎2016-12-02

@mdgm wrote:

RAID-6 requires a minimum of four disks for each layer.

Yes. Triple RAID-1 could do it with three, but XRAID2 and flexraid don't use it. So you need four.

btaroli · ‎2016-12-02

But why play block level RAID games any more? Time for ZFS or Btrfs to properly take over. 😉 I haven't been closely following the progress on making Btrfs "raid" protection more solid. But I know ZFS is already there. Between the two, I think I prefer Btrfs'es approach because it makes changing protection levels a bit more fluid; as opposed to ZFS which does creare more rigid relationships beteeen physical devices.

As an update, we're over a day into adding a second parity to this array and it's not quite 10%. Looking very much like it's going to take two weeks! Crazy.

StephenB · ‎2016-12-02

@btaroli wrote:

I haven't been closely following the progress on making Btrfs "raid" protection more solid.

Not production ready yet. https://btrfs.wiki.kernel.org/index.php/RAID56

btaroli · ‎2016-12-04

Well, i've gotten into a bit of a pickle it seems. Was working on config for one of my installed apps, but it's service got stuck (zombied), so I used FrontView to reboot the NAS. The resync for the volume is at about 20% (after a couple days). System goes through normal reboot, and I see disks fire back up and all looks normal... except that /data appears gone. 😞

So far what I can tell from madm is that the array is there and in State "clean, degraded, reshaping" and Reshape status is 22%. Strangely, I received the following messages from ROS:

02:01:27 Remove inactive volumes to use the disk. Disk ...

02:02:01 Remove inactive volumes to use the disk. Disk ...

02:02:20 Resyncing started for Volume data

02:06:00 Remove inactive volumes to use the disk. Disk ...

On the front panel, I see period display of "Reshape data ...%", but of course when I bring up the Vol display is says "data 0/0TB". In FrontView, the volume is red, but it also says "Resyncing in progress: ....% complete. Remaining time: 240+ hours". I'm paraphrasing some of the figures, just because they're changing.

So... it would certainly appear the DATA is there, but any thoughts on why it's reacting as if the /data volume isn't? What's the best course at this point? Do I simply have to expect /data to be unavailable until the reshape completes now? Or is there a way to get /data reconnected while that's in progress? And, if I do /nothing/, what happens when the reshape is completed? Does it just remount the volume and everyone's happy?

I suppose I could open a support ticket too, but since I've been tracking this here I thought I'd update.

btaroli · ‎2016-12-04

ticket 27750873 submitted

btaroli · ‎2016-12-04

Replaced by ticket 27750891, since I goofed and entered the wrong email in the first one. Fun.

mdgm-ntgr · ‎2016-12-04

This appears to be a minor issue that in some rare cases may be run into when migrating disks from one chassis to another.

Edit: Your system looks fixed now.

btaroli · ‎2016-12-04

Yes, it seems so. 🙂 A bit of a scare! Hadn't see this sort of thing, as I've never moved disks between boxes before. Appreciate your help!

FramerV · ‎2016-12-04

Hi btaroli,

If your issue is now resolved, we encourage you to mark the appropriate reply as the “Accept as Solution” or post what resolved it and mark it as solution so others can be confident in benefiting from the solution.

The Netgear community looks forward to hearing from you and being a helpful resource in the future!

Regards,

btaroli · ‎2016-12-06

Well, this thread already has a "solution," since it was meant to be informational anyway. 🙂 But in the case of the volume that was resyncing but not mounting, the issue was that the disks had been moved between NAS'es. Apparently, the configuration managed by ROS behind the scenes applies a label to the /data volume comprised of the hostid (not the host name, the output of the hostid shell command) and the volume name, usually "data" (for X-RAID), so "hostid:data". ROS uses that volume label to find and mount the /data volume.

But when disks are moved bewteen NAS'es, that label won't match. Apparently, there is logic to detect this, relabel the volume, and adjust the configuration so that things "just work." Only in this case that didn't happen for some reason. And eventually, my ROS install decided that it couldn't mount the /data volume. But mdadm *did* see the device array and was busy resyncing it as before. So this gave the very odd presentation of a volume that wasn't available and yet was. 🙂

The solution was to manually update the volume label and adjust associated ROS configuration so that the expected label was present and mounted at boot time. Of course, this was conducted by Netgear support over an enabled Support Access shell. 🙂 But just sharing the basic details here for completeness, since you asked.

Oh, and for the record, the rebuild (for second parity stripe) is now at 38%. Holy cow... never ever run these big drives with less than RAID6, folks. I can't imagine what rebuild times are going to be like once I get enough 8TB disks in here to make the second 4TB stripe in the array active. LOL

Question... does the mdadm rebuild benefit at all from additional RAM? I know I've got this particular NAS under some memory pressure since it arrived with only 4GB and my installed addons, plus normal usage, keep used memory (w/o cache) at just over 2GB.

btaroli · ‎2016-12-11

Nearly 70% today... and then a bad turn. Someone turns off the power strip with the NAS on it. I find it later and it boots up with volume mounted, but resync doesn't seem to be running despite showing 69.4% completed. I try for clean restart from web interface and it comes up with no data volume. Shows "offline" volume with all the drives in it, but not sure how to proceed.

Submitted SR 27783352 and waiting to hear back. Logs and diagnostic access included.

FramerV · ‎2016-12-12

Hi btaroli.

Given the event that occurred and as well as its result it was a good idea to get support already. I would have advised to do so if in case.

They should provide updates once the checking on the back-end is done.

Regards,

FramerV · ‎2016-12-16

Hi btaroli.

We’d greatly appreciate hearing your feedback letting us know if the information we provided has helped resolve your issue or if you need further assistance.

If your issue is now resolved, we encourage you to mark the appropriate reply as the “Accept as Solution” or post what resolved it and mark it as solution so others can be confident in benefiting from the solution.

The Netgear community looks forward to hearing from you and being a helpful resource in the future!

Regards,

btaroli · ‎2016-12-16

Well, sadly, it was not to be that the raid5 to 6 conversion could be safely resumed, and it's not entirely clear why it failed to recover after the apparent power event. But the good news is that the support heroes did get the volume into read only mode and we were able to spare 12TB of data from a ghastly fate. I've destroyed and re-created the "data" volume on the affected device and it's already gotten through 63% of the initial resync /while/ 12TB and over 3M files copied back to it in about a day.

So if faced with the prospect of migrating existing disks and coverting to raid6 when enough storage and different NAS heads are avilable to build fresh and copy data, I know which route I'LL take. 😉 It's all a learning process, after all. 😄

FramerV · ‎2016-12-17

Hi btaroli,

Thank you for updating us and its good to hear that you were able to some of the files. Feel free to mark any of the posts as an accepted solution so other may be guided also on what to do just in case.

Regards,

Planning Ahead for Capacity Upgrade

Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade

Re: Planning Ahead for Capacity Upgrade