Forum Discussion

Sensei

Mar 30, 2016

Solved

Inaccurate reporting of used/free volume space

I have been moving all of my legacy equipment from OS 4.2.x to 6.4.2, so I have to factory default each of them with the new OS. As I do so, I've been making sure I have a new backup of data before destroying one of the existing two copies. That means I've been moving a lot of data on and off one system for temporary storage, and have recently filled it beyond the recommended limit. But while access got slowed as it got very full, that's the only performance issue I saw until now. I just deleted a very large amount of data, but did not delete the share that contained it nor any snapshots.

When the volume was almost full (a bit more than 2% still avaliable, but that's still over 350MB reported free), came the first issue. Trying to add more files, either via Windows drag-and-drop or a ReadyNAS backup, failed due to being "out of space". Well, how can it be out of space and still report 350GB available on both the Share and Volume pages? If it's not really available, it should not be reported as available.

Then, after I deleted that significant mass of files, the available space barely changed. I rather expected that because of the snapshots. But instead of the Shares page indicating most of it was used by snapshots, it indicated it was active storage. The totals of the shares' "Consumed" space was far less than the total "Data" space shown on the Shares and Volume pages, and Windows shows those consumed space numbers to be accurate. The available space reported in Frontview agrees with a DF command from SSH, but I don't know how to differentiate snapshot vs active space in SSH. I suppose this could be a BTFRS issue, not a ReadyNAS-unique one. Unfortunately, I didn't think to go to SSH when I was "out of space".

So, I could have deleted the snapshots and see what happened as I did, but I decided to start another backup instead. Now will come the real test -- will the system delete snapshots and make the space available, as it should, or will it still say I'm out of space? Currently, the available space continues to drop, and has dropped below where it "ran out" before without failing. Stay tuned for reports on what happens next, and any suggestions as to what to try are welcome. If it's a bug, what can I do to help pin it down? If this is simply "the nature of the beast", it would be nice to characterize that nature so as not to get bitten too often.

While this is happening on a legacy Pro6 system with 6.4.2, I don't see why it would be unique to that.

Free space reporting

Performance

Sandshark
Apr 05, 2016
OK, so now that all of the back-and-forth transferring of files is done, I did some more looking into things. It turns out I still had about 1TB of "misssing" space that I mistaklenly thought was accounted for in private shares. (See my message in Idea Exchange about adding private share usage to the Shares page.).

Well, it turns out that this really is the nature of the beast. And the name of that beast is BTRFS. Once I learned that the standard Linux tools for checking drive space won't tell the whole story about BTRFS and a bit about B-Trees, metadata, and such, it turns out that OS6 is reporting exactly what BTRFS is telling it. As I went on experimenting, I used a mix of BTRFS in SSH and Frontview to check space, and they always agreed. Frontview just also doesn't give you the whole story -- the unbalanced part of the metadata that's still taking up space.

It seems that metadata can use up allocated space while disconnected (my term) from any real or snapshot data. That results in an unbalanced system -- for which a balance is the cure. But it also seems like the balance available in the Frontview UI is not a complete balance. It even says it reclaims only chunks with 50% or less use. It also seems that not all space reclaimed by a balance is immediately visible, but that either time or a reboot is required (I rebooted, so don't know if time would have done the same thing.

So, what is the best way to end up in this pickle with an unbalanced system? Do exactly what I did -- delete a large group of files and then immediately start refilling that space before the system can balance itself out. Note to self: Don't do that! It is unclear to me if or how much this would have been a problem if snapshots were not enabled. But I don't think it has a lot of effect.

So, where is what i did to get back my lost space:

First, I ran the balance available in Frontview. That freed up some space, but not a lot. I wish I had done a reboot then, as it may have freed more than I realized. Nothing I found on BTRFS indicated a reboot or unmount/mount should be necessry, but it certainhly had a big effect when I finally did.

Having not freed as much as I thought it would, I then went to SSH and started successively commanding a balance with larger and larger dusage options, such as:
btrfs balance start -dusage=55 /data

Until I got past 50, it didn't free anything, so I think that's what is done in Frontview. Plus, that matches the comment when you run it. The tradeoff is time, so i can see why a complete balance would not be the standard.

But I went all the way to dusage=100 and still wasn't seeing all my space. Finally, I saw somewhere that the following does a complete balance:
btrfs balance start /mnt
And, sure enough, that said it relocated the most yet (12 out of 12 chunks), though it took a good long while to do it. But there was still a lot of space missing. Here is where I decided to reboot. And, low and behold, all my precious space was back once I did. At what point a reboot would have shown most of it already reclaimed, I unfortunately do not know.

So, for those who have "lost" space, try a rebalance and reboot. But it might be nice to have options in Fronftview as to the depth of the balance and an easy way to assess the unbalance without going to SSH. My leagcy systems were out of warranty long before I moved them to OS6, so dropping down to SSH is not an issue for me. But for many, it would be. Also, I knew if all else failed or I made things worse, I could factory default this system.

For those interested, I got most of my newfound BTRFS knowlege here: https://btrfs.wiki.kernel.org/index.php/Main_Page

7 Replies

Replies have been turned off for this discussion

mdgm-ntgr
NETGEAR Employee Retired
Mar 30, 2016
Sandshark wrote:

That means I've been moving a lot of data on and off one system for temporary storage, and have recently filled it beyond the recommended limit. But while access got slowed as it got very full, that's the only performance issue I saw until now. I just deleted a very large amount of data, but did not delete the share that contained it nor any snapshots.

If the data was there when a snapshot that currently exists was taken then deleting the data will not free up space. You need to delete the data, and all the snapshots that contain the data to free up space.

Sandshark wrote:

When the volume was almost full (a bit more than 2% still avaliable, but that's still over 350MB reported free), came the first issue. Trying to add more files, either via Windows drag-and-drop or a ReadyNAS backup, failed due to being "out of space". Well, how can it be out of space and still report 350GB available on both the Share and Volume pages? If it's not really available, it should not be reported as available.

That is extremely full. You should schedule regular volume maintenance e.g. defrag and balance. A balance should resolve this particular problem however you should get volume usage back down below 80% before running a balance. If you run a balance with a very full volume then the balance may fail.

Sandshark wrote:

Then, after I deleted that significant mass of files, the available space barely changed. I rather expected that because of the snapshots. But instead of the Shares page indicating most of it was used by snapshots, it indicated it was active storage. The totals of the shares' "Consumed" space was far less than the total "Data" space shown on the Shares and Volume pages, and Windows shows those consumed space numbers to be accurate. The available space reported in Frontview agrees with a DF command from SSH, but I don't know how to differentiate snapshot vs active space in SSH. I suppose this could be a BTFRS issue, not a ReadyNAS-unique one. Unfortunately, I didn't think to go to SSH when I was "out of space".

When you upgraded to OS6 did you upgrade straight to 6.4.x or did you put older firmware on first then update to newer OS6 firmware over time?

Have you tried refreshing the page in the UI to get more up to date numbers?

Sandshark wrote:

So, I could have deleted the snapshots and see what happened as I did, but I decided to start another backup instead. Now will come the real test -- will the system delete snapshots and make the space available, as it should, or will it still say I'm out of space?

When volume usage exceeds 95% the oldest smart snapshots will be automatically deleted till volume usage falls back below 95%. It can take a while for the space to be freed up especially if you have a lot of snapshots. So you could put data on a lot quicker than it can be freed up.

Have you tried refreshing the page in the UI to get more up to date numbers.

Sandshark wrote:

If it's a bug, what can I do to help pin it down? If this is simply "the nature of the beast", it would be nice to characterize that nature so as not to get bitten too often.

Doesn't sound like a bug.

Sandshark wrote:

While this is happening on a legacy Pro6 system with 6.4.2, I don't see why it would be unique to that.

This is not something specific to running OS6 on legacy devices.

Can you send in your logs (see the Sending Logs link in my sig)?
- Sandshark
  Sensei
  Mar 30, 2016
  mdgm wrote:
  Sandshark wrote:
  That means I've been moving a lot of data on and off one system for temporary storage, and have recently filled it beyond the recommended limit. But while access got slowed as it got very full, that's the only performance issue I saw until now. I just deleted a very large amount of data, but did not delete the share that contained it nor any snapshots.
  If the data was there when a snapshot that currently exists was taken then deleting the data will not free up space. You need to delete the data, and all the snapshots that contain the data to free up space.
  Yes, but it should show the correct snapshot to live data ratio, which it did not.
  
  Sandshark wrote:
  
  When the volume was almost full (a bit more than 2% still avaliable, but that's still over 350MB reported free), came the first issue. Trying to add more files, either via Windows drag-and-drop or a ReadyNAS backup, failed due to being "out of space". Well, how can it be out of space and still report 350GB available on both the Share and Volume pages? If it's not really available, it should not be reported as available.
  That is extremely full. You should schedule regular volume maintenance e.g. defrag and balance. A balance should resolve this particular problem however you should get volume usage back down below 80% before running a balance. If you run a balance with a very full volume then the balance may fail.
  
  As I said, I am using this as temporary storage as I upgrrade other NAS's. Once I'm done, I'm going to factory default it. This is somewhat of a "torture test", though not initially intended to be so.
  
  Sandshark wrote:
  
  Then, after I deleted that significant mass of files, the available space barely changed. I rather expected that because of the snapshots. But instead of the Shares page indicating most of it was used by snapshots, it indicated it was active storage. The totals of the shares' "Consumed" space was far less than the total "Data" space shown on the Shares and Volume pages, and Windows shows those consumed space numbers to be accurate. The available space reported in Frontview agrees with a DF command from SSH, but I don't know how to differentiate snapshot vs active space in SSH. I suppose this could be a BTFRS issue, not a ReadyNAS-unique one. Unfortunately, I didn't think to go to SSH when I was "out of space".
  
  When you upgraded to OS6 did you upgrade straight to 6.4.x or did you put older firmware on first then update to newer OS6 firmware over time?
  
  On this one only, I've been "climbing the ladder" starting at 6.1.something. I held out on the others till now.
  
  Have you tried refreshing the page in the UI to get more up to date numbers?
  Yes, several times.
  
  Sandshark wrote:
  So, I could have deleted the snapshots and see what happened as I did, but I decided to start another backup instead. Now will come the real test -- will the system delete snapshots and make the space available, as it should, or will it still say I'm out of space?
  When volume usage exceeds 95% the oldest smart snapshots will be automatically deleted till volume usage falls back below 95%. It can take a while for the space to be freed up especially if you have a lot of snapshots. So you could put data on a lot quicker than it can be freed up.
  
  Well, yes and no. Doing a backlup job does not do that -- it stops and says the drive is full. I tried both RSYNC and "Windows Timestamp" methods. Drag-and-drop from Windows did this second time, and the totals matched after that. Not sure why it didn't the first time. It dopes that when the data with snapshots exceeds 95% or without? Mine definately exceeded 95% with snapshots but not without. Why the difference the second time? Maybe the size of the file being acted on. The first time was with large backup files and the second with small music files.
  
  Have you tried refreshing the page in the UI to get more up to date numbers.
  
  All numbers are from refreshed data.
  Sandshark wrote:
  
  If it's a bug, what can I do to help pin it down? If this is simply "the nature of the beast", it would be nice to characterize that nature so as not to get bitten too often.
  Doesn't sound like a bug.
  
  Something is amiss. I noticed with DF -h in SSH that the RAID is being reported as /dev/md127 instead of /dev/md/data-0. Googling the md127 tells me something is amiss in the way the RAID is mounted. I probably should have turned off snapshots when doing all of these back-and-forth transfers. They aren't really best for that environment (nor needed, since everything was temporary backups). I have no idea when that happened -- could have been from quite some time before I started all of this torture test. Good that my plan is to ultimately do a factory default and I'm almost there.
  
  Can you send in your logs (see the Sending Logs link in my sig)?
  I've cleared them several times because of all the stuff I'm doing. I've seen so many cases of full OS partitions here on the forum that I'm extra careful when doing unusual stuff. At this point, I think the misreporting is likely due to whatever caused the md127 mount. The only other thing that doesn't seem to work 100% is freeing space occupied by snapshots, especially due to a backup job needing some of the snapshot space.
  
  When I get done with all of these transfers and again have a NAS free for experimentation, I'll try some more of this under more controlled conditions.
  - mdgm-ntgr
    NETGEAR Employee Retired
    Mar 31, 2016
    Sandshark wrote:
    
    On this one only, I've been "climbing the ladder" starting at 6.1.something. I held out on the others till now.
    
    Sounds like you ran into a one-off snapshot upgrade issue when you updated to 6.2.x (or later). I can easily confirm this by looking at the logs.
    
    you could put data on a lot quicker than it can be freed up.
    
    Sandshark wrote:
    
    Well, yes and no. Doing a backlup job does not do that -- it stops and says the drive is full. I tried both RSYNC and "Windows Timestamp" methods. Drag-and-drop from Windows did this second time, and the totals matched after that. Not sure why it didn't the first time. It dopes that when the data with snapshots exceeds 95% or without? Mine definately exceeded 95% with snapshots but not without. Why the difference the second time? Maybe the size of the file being acted on. The first time was with large backup files and the second with small music files.
    
    I did say "you could put data on a lot quicker than it can be freed up." If you have a lot of snapshots data could be added much quicker than deleting snapshots frees it up so you could run into the out of space problem due to needing to allocate more space to metadata or data, but having none available due to the space being fully allocated.
    
    Sandshark wrote:
    
    Something is amiss. I noticed with DF -h in SSH that the RAID is being reported as /dev/md127 instead of /dev/md/data-0. Googling the md127 tells me something is amiss in the way the RAID is mounted. I probably should have turned off snapshots when doing all of these back-and-forth transfers. They aren't really best for that environment (nor needed, since everything was temporary backups). I have no idea when that happened -- could have been from quite some time before I started all of this torture test. Good that my plan is to ultimately do a factory default and I'm almost there.
    
    The RAID showing as /dev/md127 is not the problem.
    
    Sandshark wrote:
    
    I've cleared them several times because of all the stuff I'm doing.
    
    That clearing only clears some of the logs. In any case some of the logs downloaded when downloading the logs are generated on demand including the log that I would look at to confirm you ran into the issue I suspect you have run into.
    
    Sandshark wrote:
    
    I've seen so many cases of full OS partitions here on the forum that I'm extra careful when doing unusual stuff.
    
    Full OS partitions can be caused by doing unusual things via SSH or misconfigured apps. With the way we manage the logs now if you keep an eye on things the root partition usage shouldn't be a problem.
    
    Sandshark wrote:
    
    At this point, I think the misreporting is likely due to whatever caused the md127 mount.
    
    It has nothing to do with that. We use md raid with BTRFS on top.
    
    Sandshark wrote:
    
    The only other thing that doesn't seem to work 100% is freeing space occupied by snapshots, especially due to a backup job needing some of the snapshot space.
    
    That could be related to the snapshot upgrade issue, but again space from deleting snapshots may take some time to free up. We allow a little time after deleting a snapshot for some space to be freed up before moving onto the next one, so if a lot need to be deleted to free up space it could take quite some time. When a snapshot is deleted all the newer ones need to be updated recursively which may take quite a while if you have a huge number of them.