Re: Trying to understand how to identify which raid volume a file is on within ReadyNAS btrfs.

thierman · ‎2020-12-07

Hi,

I'm trying to understand a corruption issue I've run into before I take any measures to fix.

I'm having the standard "BTRFS error (device md125): parent transid verify failed" that usually results in the answer in this forum and elsewhere of "Just restore from backup"...

Well, before I restore, I want to understand if there is any chance of recovering one of the corrupted files I have, since I created a file recently before I had a chance to back it up, this error seems to have corrupted the data.

Some background, this error occured after 128 days having the readynas working, without reboot. No power loss, it's on a UPS. But curiously only about 1 minute after completing a defrag, the volume went read only with the error message:

warning:volume:LOGMSG_VOLUME_READONLY The volume data encountered an error and was made read-only. It is recommended to backup your data.

Here is what I'm trying to understand.

the file corrupted is a file called "MoveNewMusic.pl" a stat of that file shows:

# stat MoveNewMusic.pl
  File: 'MoveNewMusic.pl'
  Size: 7610            Blocks: 16         IO Block: 4096   regular file
Device: ebh/235d        Inode: 1751        Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-11-11 10:06:26.312009653 -0700
Modify: 2020-11-11 22:45:56.855610574 -0700
Change: 2020-11-11 22:45:56.955601949 -0700
 Birth: -

# stat -f MoveNewMusic.pl
  File: "MoveNewMusic.pl"
    ID: 9b15acc4df0dcccb Namelen: 255     Type: btrfs
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 12689899104 Free: 4142678422 Available: 4140347222
Inodes: Total: 0          Free: 0

I'm not sure how either the Device in this case "eb" relates to the a device number of e or 14, and a minor device number of b of 11 relates to any of the /dev/md* devices I have.

A df of that file shows a Filesystem of -, which I'm not sure is normal. Since if I do a df of a good file I get a filesystem of /dev/md127 see below:

# df MoveNewMusic.pl
Filesystem       1K-blocks        Used   Available Use% Mounted on
-              50759596416 34188882728 16561388888  68% /run/nfs4/home/cthierman

# df version.txt
Filesystem       1K-blocks        Used   Available Use% Mounted on
/dev/md127     50759596416 34188882728 16561388888  68% /run/nfs4/home

dmesg is showing that I have errors on /dev/md125, but how can I confirm, with the above info that /dev/md125 is where this file is/was sitting. Is it possible that I have corruption on one of the other two volumes /dev/md127 and /dev/md126 of which no errors or warning appear in dmesg.

Here is a sample of the dmesg -T

[Mon Dec  7 01:51:28 2020] BTRFS error (device md125): parent transid verify failed on 1087635456 wanted 3045699 found 2651197
[Mon Dec  7 01:51:28 2020] BTRFS error (device md125): parent transid verify failed on 1087635456 wanted 3045699 found 2651197
[Mon Dec  7 01:51:29 2020] BTRFS error (device md125): parent transid verify failed on 1087635456 wanted 3045699 found 2651197
[Mon Dec  7 01:51:29 2020] __btrfs_lookup_bio_sums: 3955 callbacks suppressed
[Mon Dec  7 01:51:29 2020] BTRFS info (device md125): no csum found for inode 8866 start 14287339520
[Mon Dec  7 01:51:29 2020] BTRFS info (device md125): no csum found for inode 8866 start 14287343616

And here is a cat of /proc/mdstat, and yes, I'm presently running a scrub to see if that will fix anything.

# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md125 : active raid1 sdd5[0] sdf5[1]
      1952316416 blocks super 1.2 [2/2] [UU]
      [==================>..]  resync = 92.2% (1801338496/1952316416) finish=82.2min speed=30574K/sec

md126 : active raid5 sda4[0] sdf4[5] sde4[4] sdd4[3] sdc4[2] sdb4[1]
      34180206080 blocks super 1.2 level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]

md127 : active raid5 sdc3[8] sde3[10] sdf3[11] sdd3[9] sdb3[7] sda3[6]
      14627073920 blocks super 1.2 level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]

md1 : active raid10 sda2[0] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1]
      1566720 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU]

md0 : active raid1 sda1[6] sde1[10] sdf1[11] sdd1[9] sdc1[8] sdb1[7]
      4190208 blocks super 1.2 [6/6] [UUUUUU]

My hope is that, if I can understand how the device relates to a btrfs filesystem/RAID volume, maybe I can understand if I have a chance to recover this file, by using some of the methods suggested in other posts. Ie, zero the the log on, heaven forbid, /dev/md125. Or run a btrfs check... Or maybe just a simple reboot and cross my fingers..

Hoping someone well versed in the in's and out's of the ReadyNas can fill me in on what I'm missing.

Thanks

StephenB · ‎2020-12-07

Maybe start with something simpler:

# btrfs device stats /data

(substituting your actual volume name if it isn't data)

thierman · ‎2020-12-07

Sorry, I should have mentioned I tried that.. and this was the output... Still hoping there is a way to find where that file ended up and if I have a chance of recovering.

# btrfs device stats /data
[/dev/md127].write_io_errs    0
[/dev/md127].read_io_errs     0
[/dev/md127].flush_io_errs    0
[/dev/md127].corruption_errs  0
[/dev/md127].generation_errs  0
[/dev/md126].write_io_errs    0
[/dev/md126].read_io_errs     0
[/dev/md126].flush_io_errs    0
[/dev/md126].corruption_errs  0
[/dev/md126].generation_errs  0
[/dev/md125].write_io_errs    0
[/dev/md125].read_io_errs     0
[/dev/md125].flush_io_errs    0
[/dev/md125].corruption_errs  0
[/dev/md125].generation_errs  0

thierman · ‎2020-12-07

I should probably also add, that this perl script that I wrote, which normally is all text, is now just a file full of hex 01's. As seen here:

# od -x MoveNewMusic.pl
0000000 0101 0101 0101 0101 0101 0101 0101 0101
*
0016660 0101 0101 0101 0101 0101
0016672

Which is not something you wish for, from your NAS. So by understanding the problem, I'm hoping to understand to what extent has this affected the rest of my Data...

P.S. Scrub did nothing to fix the problem... I'm reluctant to reboot till I have all the data moved to my new Synology NAS, where BTRFS is replaced by ext4.

Sadly, without understanding what caused this, I have lost all confidence in the ReadyNAS platform.

StephenB · ‎2020-12-07

Were checksums turned on for the volume?

Maybe after the data is copied, you can try btrfs check.

thierman · ‎2020-12-07

@StephenB wrote:
Were checksums turned on for the volume?

Maybe after the data is copied, you can try btrfs check.

Now that is an interesting question. For /data yes. But, the /home directory (volume) was created by readynas when it was imaged. I assume the answer would be yes, as /home appears to show up on /dev/md127, the same device as /data, however, the subdirectories that my script sits in, shows up oddly, with a - for a Filesystem, when you do a df on it.

# df /home
Filesystem       1K-blocks        Used   Available Use% Mounted on
/dev/md127     50759596416 34188882728 16561388888  68% /home
# df /home/cthierman
Filesystem       1K-blocks        Used   Available Use% Mounted on
-              50759596416 34188882728 16561388888  68% /home/cthierman

So, I'm not really sure what the answer is to your question. Other than, I hope so....

As to the "btrfs check" yes, that is the plan, after I get all the data moved... Sadly, many Terabytes worth.... Probably take some time.

StephenB · ‎2020-12-08

@thierman wrote:

Sadly, many Terabytes worth.... Probably take some time.

I saw about 320 GiB an hour when I copied about 4.5 TiB from my main NAS to a Pro-6. That was using rsync (a backup job running on the destination NAS).

thierman · ‎2020-12-08

Yeah, interestingly, I have two copies running. One I started on Dec6th @ 11:53am, mytime, which is presently Dec 8th 10:16am.

So two days ago. To a direct attached, over USB, harddrive off the ReadyNas 316.

That copy has copied 992G of data in that time.

Meanwhile, I started an rsync less than 24 hours ago across a 1Gb/s network, using rsync to the synology and it has already transferred 1.4TB of data.

I'm beginning to think that USB port on the ReadyNas 316 is a version 1.0 USB. Or the drive I have is stupidly slow.

I also have a ReadyNas Pro Pioneer, and managed to rsync data to it faster than the USB... Which is saying something cause the Pro isn't the fastest NAS....

StephenB · ‎2020-12-08

@thierman wrote:

I'm beginning to think that USB port on the ReadyNas 316 is a version 1.0 USB.

The front port is USB 2.0. The rear ports are USB 3.0.

I am wondering if you are writing to an SMR drive. That can be very slow for sustained writes.

thierman · ‎2020-12-11

It's a 3TB Hitachi drive... I suspect your right.. Though I'm not sure, but looks like Hitachi was using SMR at the time I bought it. Though from what I can see mainly on 4TB and above drives.. But I'm going to assume that it was also my 3TB drive.

An update on what I've found out...

I put together a table of the various settings I was using for each volume, as I found some volumes were uneffected, while others had lots of corrupted files. Having no idea what caused the corruption, I decided to create a script that looked for and identified the corrupt files. Since they all have the same attributes, all the bytes have been replaced with hex x01, so any file with all zero's except the right most bit as a one is flagged as corrupted, using the first 100 bytes as a sample. Ie, every byte for the 100 bytes must be of the same 0x01.

Here is the table:

What I first noticed is the only volumes that had corrupt files are the volumes where BITROT is turned on.. This is an interesting find, as it leads me to conclude that for all that BTRFS builds COW as a solution to BITROT. Seems, in my case at least, this was the single most contributing factor to a complete loss of data. I would advise anyone running BTRFS to turn off Copy On Write (AKA BITROT protection) ASAP. As I ran for many years without an issue, then one night at 4am. Wham! Volume is readonly, and huge swaths of corrupted files, which I suspect can only be recovered from a backup.

I would love to know how to see where the inodes for these files sit. Something to see if there is any correlation to a particulair drive.

Right now, all I know is that just before the ReadyNas reported problems, a defrag had just finished not much more than 1 minute prior...

Perhaps a warning for others... And a question for those who have also seen their ReadyNas corrupted. Was BITROT protection on on the corrupted volumes???