Re: ReadyNas 628x Mysterious Storage Consumption

pwptech · ‎2021-02-23

I have an RN628x with 8x 2TB drives in RAID6. In total I have about 10.89TB of provisioned storage. Out of the main storage I have 3 main shares. I also have 1 thick provisioned LUN. I do not utilize snapshots.

Right now my overall storage is saying I have 855GB free out of 10.89TB and 577GB is from snapshots. I'm calculating I should be under 8TB of total storage with over 2TB free. Where is this other data, how can I find it and clean it up?

I have tried a disk defragmentation, disk test, balance, and scrub on the LUN. I don't have apps on the NAS or use any other network or cloud services. This is what the storage frontend is reporting to me:

Shares:
- Backup: 2.9TB consumed

- SDS: 127MB consumed

- Software: 90GB consumed

LUN:

- iSCSI thick provision: 4.5TB

For over a year I've been able to keep my free space around 2TB. Suddenly last weekend something occurred on the NAS where all the free space was eaten up. SSH into the console showed me disk space was 100% used. I had to resort to deleting an old retained backup after moving it to external storage just to keep operating the NAS. Currently I'm at 92% disk usage. I've scoured the command line but have not found where the mysterious data is located or what is causing my array to become full capacity.

StephenB · ‎2021-02-23

Can you post the output of these commands?

# btrfs fi sh /data
# btrfs fi df /data

You might also find these helpful, but there is some privacy leakage, so you might not want to post them.

# btrfs subv list /data
# du -hd1 /data
# btrfs qgroup show /data

Given the overall situation, you might want to delete all the snapshots.

pwptech · ‎2021-02-23

Thanks for the reply @StephenB

I have deleted all snapshots but I'm still showing 577.04GB of snapshot data on the frontend GUI [depicted by yellow slice of pie chart]

root@NAS-01:~# btrfs fi sh /data
Label: '0a4357ec:data'  uuid: 3e5e6709-18d6-4dc0-b238-ba3e4cbaf8de
        Total devices 1 FS bytes used 10.06TiB
        devid    1 size 10.89TiB used 10.86TiB path /dev/md127

root@NAS-01:~# btrfs fi df /data
Data, single: total=10.86TiB, used=10.06TiB
System, DUP: total=32.00MiB, used=1.50MiB
Metadata, DUP: total=1.00GiB, used=429.44MiB
GlobalReserve, single: total=190.06MiB, used=0.00B

To delete all snapshots should I just delete all directories containing .snapshot as the name? When I show all snapshots under /data I only have 1 listed and it's less than 150MB.

root@NAS-01:~# btrfs subv list -s /data
ID 40579 gen 49790139 cgen 49790139 top level 6346 otime 2021-02-17 00:00:16 path MSDS/.snapshots/980/snapshot
root@NAS-01:~# du -sh /data/MSDS/.snapshots
127M /data/MSDS/.snapshots

When showing du -sh on root it shows 7.5TB in total.

root@NAS-01:~# du -sh /
du: cannot access '/proc/12244/task/12244/fd/3': No such file or directory
du: cannot access '/proc/12244/task/12244/fdinfo/3': No such file or directory
du: cannot access '/proc/12244/fd/4': No such file or directory
du: cannot access '/proc/12244/fdinfo/4': No such file or directory
7.5T    /
root@NAS-01:~# du -sh /data
7.4T    /data

root@NAS-01:~# du -hd1 /
6.4M    /bin
0       /boot
4.0K    /dev
11M     /etc
4.0K    /home
33M     /lib
4.0K    /lib64
0       /media
4.6M    /opt
du: cannot access '/proc/12337/task/12337/fd/3': No such file or directory
du: cannot access '/proc/12337/task/12337/fdinfo/3': No such file or directory
du: cannot access '/proc/12337/fd/4': No such file or directory
du: cannot access '/proc/12337/fdinfo/4': No such file or directory
0       /proc
20K     /root
91G     /run
11M     /sbin
0       /srv
0       /sys
0       /tmp
272M    /usr
316M    /var
0       /mnt
7.4T    /data
2.5M    /apps
31M     /frontview
7.5T    /

StephenB · ‎2021-02-23

@pwptech wrote:

To delete all snapshots should I just delete all directories containing .snapshot as the name? When I show all snapshots under /data I only have 1 listed and it's less than 150MB.
root@NAS-01:~# btrfs subv list -s /data
ID 40579 gen 49790139 cgen 49790139 top level 6346 otime 2021-02-17 00:00:16 path MSDS/.snapshots/980/snapshot
root@NAS-01:~# du -sh /data/MSDS/.snapshots
127M /data/MSDS/.snapshots

No, you won't be able to delete the snapshots with rm. They are btrfs subvolumes - so you can delete them with btrfs subv delete <path>

Try turning off volume quota (from the volume settings wheel), and then turn it back on.

I'd also try another balance.

@rn_enthusiast might have some other suggestions.

pwptech · ‎2021-02-24

Thanks for the suggestions@StephenB

I have disabled then re-enabled quota. I also did Disk Balance and it completed successfully. The snapshots have been removed now but I'm still facing the same issue where about 2TB seems to be missing. I am showing 10.09TB for Data in the frontend.

I will also perform a scrub and defrag. Short of that working I'm not sure what else could be done except rebuilding the array which I really don't want to do.

rn_enthusiast · ‎2021-02-24

Hi @pwptech

Would you mind grabbing the NAS log-set for me? Then I will have a look at it. On the web admin page, go to "System" > "Logs" > click "Download logs".

This should download a zip file containing all the logs. You can then upload that zip file to Google Drive, Dropbox or similar and make a link which I can use to download it. PM me this link - don't post it publicly here.

Thanks for the ping @StephenB 🙂

Cheers

rn_enthusiast · ‎2021-02-26

The volume is 10.89 TiB and the used space is 10.22TiB (93% full).

Label: '0a4357ec:data'  uuid: 3e5e6709-18d6-4dc0-b238-ba3e4cbaf8de
	Total devices 1 FS bytes used 10.22TiB

I don't see a reason not to trust this report from the filesystem. By default the snapshots are stored in an unlistable directory and thus doing "du" probably isn't going to be accurate and as you can see only showed 7.4TB. I can see you have 2 snapshots but one of them is odd and exists in /data/._share which is a config directory.

ID 42109 gen 51022532 top level 260 path ._share/Backups/.snapshot/b_1614215108_6056

Path would be /data/._share/Backups/.snapshot/b_1614215108_6056

I wonder how this came about... Can you run do these commands and show output:

btrfs subv show /data/._share/Backups/.snapshot/b_1614215108_6056
btrfs subv list -s /data
btrfs qgroup show /data

Side note: You have a disk that failed your disk test, you need to replace it. Disk 1. The kernel logs are also complaining about this disk. Replace it asap.

[21/02/24 09:19:14 MST] warning:volume:LOGMSG_DISKTEST_RESULT_FAIL_DISK Disk test failed on disk in channel 1, model WDC_WD20EFRX-68EUZN0, serial WD-WCC4M0KH0HDA.

Device:             sda
Model:              WDC WD20EFRX-68EUZN0
Serial:             WD-WCC4M0KH0HDA
Firmware:           82.00A82W
Class:              SATA
RPM:                5400
Sectors:            3907029168
Pool:               data
PoolType:           RAID 6
PoolState:          1
PoolHostId:         a4357ec
Health data 
  ATA Error Count:                0
  Reallocated Sectors:            0
  Reallocation Events:            0
  Spin Retry Count:               0
  Current Pending Sector Count:   126
  Uncorrectable Sector Count:     0
  Temperature:                    32
  Start/Stop Count:               20
  Power-On Hours:                 24275
  Power Cycle Count:              20
  Load Cycle Count:               1034

Many of these errors in kernel logs. Not good to keep it in the NAS.

[Thu Feb 25 06:59:36 2021] sd 0:0:0:0: [sda] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb 25 06:59:36 2021] sd 0:0:0:0: [sda] tag#6 Sense Key : Medium Error [current] [descriptor]
[Thu Feb 25 06:59:36 2021] sd 0:0:0:0: [sda] tag#6 Add. Sense: Unrecovered read error - auto reallocate failed
[Thu Feb 25 06:59:36 2021] sd 0:0:0:0: [sda] tag#6 CDB: Read(10) 28 00 e3 18 6c d0 00 02 28 00
[Thu Feb 25 06:59:36 2021] blk_update_request: I/O error, dev sda, sector 3810028752

rn_enthusiast · ‎2021-03-01

Update on this one. I worked with @pwptech in PM.

We can see that there are the 3 shares/LUNs of interest, that are taking most space on the NAS:

ID 0/281 = 3.47TiB = /data/Backups
ID 0/5548 = 90.09GiB = /data/Software
ID 0/31827 = 6.65TiB = /data/VeeamBackups

"VeeamBackups" is the LUN. This LUN is configured to 4.5TB (thick) as confirmed via screenshots of the config and we can see this too, from the command-line:

root@NAS-01:~# ls -lahR /data/VeeamBackups/
/data/VeeamBackups/:
total 32K
drwxr-xr-x 1 root root 12 Mar 27 2020 .
drwxr-xr-x 1 root root 142 Feb 22 10:35 ..
drwxr-xr-x 1 root root 74 Mar 27 2020 .iscsi

/data/VeeamBackups/.iscsi:
total 4.6T
drwxr-xr-x 1 root root 74 Mar 27 2020 .
drwxr-xr-x 1 root root 12 Mar 27 2020 ..
-rw-r--r-- 1 root root 4.5T Feb 28 18:29 iscsi_lun_backing_store <<<=====
-rw-r--r-- 1 root root 36 Mar 27 2020 .serial_number

Yet still, the filesystem is allocating 2TB more than expected, to house this NAS. This lines up with what @pwptech reported in the first place, that he/she lost 2TB "out of the blue". I looked at the Status history on the NAS and found that the NAS was stable at around 20% space left, leading up to the episode:

[21/01/16 11:50:44 MST] warning:volume:LOGMSG_VOLUME_USAGE_WARNING Less than 20% of volume data's capacity is free. Performance on volume data will degrade if additional capacity is consumed. NETGEAR recommends that you add capacity to avoid performance degradation.
[21/01/31 11:06:42 MST] warning:volume:LOGMSG_VOLUME_USAGE_WARNING Less than 20% of volume data's capacity is free. Performance on volume data will degrade if additional capacity is consumed. NETGEAR recommends that you add capacity to avoid performance degradation.
[21/02/02 09:32:22 MST] warning:volume:LOGMSG_VOLUME_USAGE_WARNING Less than 20% of volume data's capacity is free. Performance on volume data will degrade if additional capacity is consumed. NETGEAR recommends that you add capacity to avoid performance degradation.
[21/02/12 13:05:30 MST] warning:volume:LOGMSG_VOLUME_USAGE_WARNING Less than 20% of volume data's capacity is free. Performance on volume data will degrade if additional capacity is consumed. NETGEAR recommends that you add capacity to avoid performance degradation.

Then suddenly it dropped to less 5% (essentially, the NAS filled up):

[21/02/20 16:47:25 MST] warning:volume:LOGMSG_VOLUME_USAGE_CRITICAL Less than 5% of volume data's capacity is free. data's performance is degraded and you risk running out of usable space. To improve performance and stability, you must add capacity or make free space.

What preceeded this was a defragmentation on the NAS - which would also defrag the LUN file (iscsi_lun_backing_store):

[21/02/13 05:11:04 MST] notice:volume:LOGMSG_DEFRAGSTART_VOLUME Defragmentation started for volume data.
[21/02/13 13:16:14 MST] notice:volume:LOGMSG_DEFRAGEND_VOLUME Defragmentation complete for volume data.

This is the only time a defrag was ever run on the NAS and a few days later, the NAS reported out of space condition. Keep in mind that these space warnings don't happen all the time so it is quite possible that the space issues happened just after the LUN was defragged or very shortly thereafter. So, the filesystem is using 6.65TiB to house a thick LUN of 4.5TB. This reminds of an issue other ReadyNAS users reported where defragging a LUN would balloon the space utilization on the NAS. @mdgm might remember more about this but we saw several reports of it. It has something to do with extents on the LUN increasing/breaking after the defrag and likely more of a BTRFS issue than a NAS issue, per se. I very much suspect exactly the same thing happened here. The only possible remedy I think there is, is to defrag the LUN again, using a smaller block size:

btrfs fi defragment -t 8192 -v /data/VeeamBackups/.iscsi/iscsi_lun_backing_store

Or even using a 4K (4096) size could maybe help even more. It is not a guarantee to work but it is probably worth a shot, I think. @mdgm could have some insight here too, if he remembers these issues - but it has been a couple of years now 🙂

Cheers

mdgm · ‎2021-03-01

I don’t think I remember that issue though it has been a while.

Another thing worth noting is that defragmentation breaks the CoW link between snapshots. If there were ever snapshots of the LUN that could be related to the problem. Edit: oops. If there were snapshots they have been deleted so probably not relevant I guess.

The snapshot with a name that started with a b would have been from a backup job. I think it was probably meant to be deleted when the backup job finished but for some reason it wasn’t. Perhaps there was a power failure at some point in the middle of a backup job or something like that.

Further edit: 1614215108 is epoch time for 25 February at about 1am. It could just be that the logs were downloaded whilst the relevant backup job was running.

rn_enthusiast · ‎2021-03-01

Yea that snapshot is gone now anyway, the backup snap (b_.....). @pwptech confirmed with live commands for me. The only snap that exists now is:

root@NAS-01:~# btrfs subv list -s /data
ID 40579 gen 49790139 cgen 49790139 top level 6346 otime 2021-02-17 00:00:16 path MSDS/.snapshots/980/snapshot

I don't see this snap really causing an issue but I mentioned via PMs that it could be deleted, if not needed.

No snaps of the LUN exists.

pwptech · ‎2021-03-01

I ran defragment for the iSCSI using 4096 block size. I notice the -v flag is for verbosity. Once I ran the command it only took a few minutes and this was the output. No noticeable size changes or gaining of space on the volume unfortunately.

root@NAS-01:~# btrfs fi defragment -t 4096 -v /data/VeeamBackups/.iscsi/iscsi_lun_backing_store
/data/VeeamBackups/.iscsi/iscsi_lun_backing_store
root@NAS-01:~#

rn_enthusiast · ‎2021-03-01

If that didn't help then I am not sure what or if anything can be done about it, to be honest. You can always destroy and re-create the LUN (backup data inside LUN first) or simply just get larger disks and keep LUN the way it is but expand the volume on the NAS. Neither are ideal, I understand.

pwptech · ‎2021-03-09

Thank you to everyone who assisted me, especially @rn_enthusiast

For anyone wondering or running into a similar issue, I ended up having to factory reset the NAS and then rebuild the iSCSI LUN and my other file shares. First I tried deleting the iSCSI LUN but space was not properly reclaimed. I only gained about 2TB of space when it should have been over 6TB.

Because the iSCSI LUN was raw mapped to a virtual machine responsible for backups, I was able to copy the data from the LUN using MiniTool Partition Wizard Enterprise by doing a disk copy. This requires a storage drive attached to the VM that is the exact same size as the LUN. Once it was copied, I revised the backup jobs to use the new temporary storage while I rebuilt the NAS. Also make sure to enable deduplication on the temporary storage if it was enabled on your raw mapped LUN. Do not copy deduplicated data to non-deuplicated storage otherwise it will reinflate.

Sandshark · ‎2021-03-09

Thanks for sharing your solution. I see they have a free version that looks like it could also accomplish this.

I suspect that a larger device with a partition the same size would have also worked.

rn_enthusiast · ‎2021-03-09

Thanks for sharing @pwptech 🙂 Glad it worked out in the end. I did suspect a FD of the NAS was necesary.