Re: Missing volume after hard reboot

alaeth · ‎2017-07-12

After losing access (no web, ssh, mounts, anything) from my ReadyNAS Pro 6 running 6.7.4 firmware, I decided to hard-reboot it by holding down the power button until it powered off. after booting back up, I was confronted by no data and two volumes (zero bytes each).

A quick search turned up this post:

https://community.netgear.com/t5/Using-your-ReadyNAS/Missing-volume-after-power-failure/td-p/1246361

However the troubleshooting steps in it do not match my experience (my /mnt/data/ is empty)...

Any help would be greatly appreciated.

I've uploaded the logs from today here:

https://drive.google.com/drive/folders/0BwD0UWci-5MxOU5Sc0lNS3IzWWM?usp=sharing

alaeth · ‎2017-07-23

I'm going to consider the data lost, and rebuild/factory reset back down to 4.2 for stability.

I think this is the answer without getting into costly data recovery services, or massive time investment.

View solution in original post

StephenB · ‎2017-07-13

It looks like there's some file corruption. Per Kernel.log

Jul 12 20:27:09 readynas01 kernel: BTRFS: device label 33ea999f:data devid 1 transid 1431232 /dev/md127
Jul 12 20:27:09 readynas01 kernel: BTRFS info (device md127): has skinny extents
Jul 12 20:27:10 readynas01 kernel: BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
Jul 12 20:27:10 readynas01 kernel: BTRFS error (device md127): failed to read block groups: -5
Jul 12 20:27:10 readynas01 kernel: BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
Jul 12 20:27:10 readynas01 kernel: BTRFS error (device md127): failed to read block groups: -5
Jul 12 20:27:10 readynas01 kernel: BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
Jul 12 20:27:10 readynas01 kernel: BTRFS error (device md127): failed to read block groups: -5
Jul 12 20:27:10 readynas01 kernel: BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
Jul 12 20:27:10 readynas01 kernel: BTRFS error (device md127): failed to read block groups: -5
Jul 12 20:27:10 readynas01 kernel: BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
Jul 12 20:27:10 readynas01 kernel: BTRFS error (device md127): failed to read block groups: -5

Jul 12 20:27:10 readynas01 kernel: BTRFS error (device md127): open_ctree failed

@jak0lantash might have some suggestions on next steps.

alaeth · ‎2017-07-13

That seems... not good. 😞

I bought a QNAP and am trying to get data restored manually for now. Luckily I have online backups of the critical stuff (10+ years of photography shoots and proofs).

Regardless of the outcome, I think I'll roll-back the firmware to the "officially" supported version 4.x. Ever since upgrading to the unsupported 6.x, I've noticed it locks up fairly often - I even setup a cron job to reboot it nightly to try and reduce the impact.

I know any OS dis-likes hard-reset... and it's been restarted this way a few times in the past couple months.

My wife thinks it's because she gave it the middle finger before powering it off...

jak0lantash · ‎2017-07-13

I'm taking a look at the logs now. But you should remove them for Google Drive. There is your serial number in there.

jak0lantash · ‎2017-07-13

You have 5 good drives (WD30EFRX), the 6th is a Seagate Desktop drive (ST3000DM001), that's a shame.

It looks like this drive has shown a failure rate higher than normal:

https://www.backblaze.com/blog/3tb-hard-drive-failure/

https://superuser.com/questions/1037228/if-someone-purchased-a-seagate-st3000dm001-3tb-hard-drive-an...

https://www.extremetech.com/extreme/222267-seagate-faces-lawsuit-over-3tb-hard-drive-failure-rates

If it was my NAS, I would replace it immediately. It doesn't show any error, but has 25,000 hours and a terrible reputation.

Based on the logs:

You have daily snapshots on 8 shares (applications, backup, Documents, istat, Music, Pictures, Transmission, Videos).

No balance nor defrag were run since the creation of the volume, 2017/01/04.

I cannot see the metadata allocation because the data volume isn't mounted.

The logs don't give any details about the configuration of the shares in regards to Bit Rot Protection (implies Copy-on-Write) and Compression, can you tell us?

My guess is that the metadata allocation is high and the data fragmented which may lead to crashes. But the issue can be completely unrelated. It's not something I can explain from these logs. Unfortunately, after a reboot, the dmesg logs are flushed. It's possible to setup something like netconsole to push the kernel logs with a similar process to syslog in order to capture them on another machine, so you have the ones just before the crash.

Not sure why, but you have a cron job to capture the last status of processes and load every 5 minutes. Maybe in an attempt to debug the lock-ups?

The LAN interface seems to be connected to a FastEthernet network (100Mbps), that's weird.

It would be interesting to know to read the loadavg.log file, but I don't know how. (@mdgm maybe?)

The NAS was power-cycled twice recently:

- It seems that the NAS was power-cycled on Jun 18 15:01 (the logs start there so can't tell when it hung).

- It seems that the NAS crashed shortly after Jun 18 18:25 and was power-cycled on Jun 19 20:28. There didn't seem to be much going on before the crash, in terms of process anyway.

Power-cycles clearly don't help, but I understand you may not have had a choice.

The NAS remained then shutdown for nearly a month. When it booted last, the volume was unmountable.

That's when it failed to mount the data volume.

Jul 12 20:27:10 readynas01 kernel: BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
Jul 12 20:27:10 readynas01 kernel: BTRFS error (device md127): failed to read block groups: -5
Jul 12 20:27:10 readynas01 kernel: BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
Jul 12 20:27:10 readynas01 kernel: BTRFS error (device md127): failed to read block groups: -5
Jul 12 20:27:10 readynas01 kernel: BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
Jul 12 20:27:10 readynas01 kernel: BTRFS error (device md127): failed to read block groups: -5
Jul 12 20:27:10 readynas01 kernel: BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
Jul 12 20:27:10 readynas01 kernel: BTRFS error (device md127): failed to read block groups: -5
Jul 12 20:27:10 readynas01 kernel: BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
Jul 12 20:27:10 readynas01 kernel: BTRFS error (device md127): failed to read block groups: -5
Jul 12 20:27:10 readynas01 kernel: BTRFS error (device md127): open_ctree failed
Jul 12 20:27:10 readynas01 mount[1465]: mount: wrong fs type, bad option, bad superblock on /dev/md127,
Jul 12 20:27:10 readynas01 mount[1465]:        missing codepage or helper program, or other error
Jul 12 20:27:10 readynas01 mount[1465]:        In some cases useful info is found in syslog - try
Jul 12 20:27:10 readynas01 mount[1465]:        dmesg | tail or so.
Jul 12 20:27:10 readynas01 systemd[1]: data.mount: Mount process exited, code=exited status=32
Jul 12 20:27:10 readynas01 systemd[1]: Failed to mount /data.

That's when you usually see many reboot attempts in the log as panic grows, but it doesn't seem you tried to reboot the NAS at all.

I would change the fstab from:

LABEL=33ea999f:data /data btrfs defaults 0 0

to:

LABEL=33ea999f:data /data btrfs defaults,ro,recovery 0 0

and try to reboot the NAS gracefully from the GUI or rn_shutdown -r

After reboot, if the data volume mounts OK, update your backups immediately.

Please then give the ouput of this command:

btrfs fi us /data

After your backups are complete (AND inspected!!!), you'll have to recreate the volume and reimport the data.

alaeth · ‎2017-07-14

Thanks for the write-up. I'll post details from the NAS once I'm home and have tried your suggestions.

Good point on the logs, I've disabled sharing.

Agreed on the Seagate... it was my first 3GB purchase.

I think copy-on-write and compression are enabled...? not 100% sure if those are defaults with 6.x

You are correct, the cron 5 minute was an attempt to narrow down the crashing cause. Good news is a have Splunk-Universal-Forwarder installed and configured. Everything from /var/log/ _should_ be captured on my Windows desktop Splunk server (shameless plug: Splunk is 100% free if your data volume is less than 500gb/day - super awesome for troubleshooting faults like this as you can correlate timesamped event across multiple files)

100mbit LAN is correct - I moved the NAS from the basement to my office upstairs and the switch there is only 100.

Once I realized the data volume was gone (about a month ago) I did some simple troubleshooting, then decided to power it down (using the web interface) until I could spend more time with it. After discussions with the spouse, we decided the age of it, and the ongoing unstability warrented a full NAs replacement. It remained off until the new one (a QNAP 671) arrived and I could mount drives in it.

I'll post again tonight once I try the fstab settings, and output from brtfs command. Any other logs you'd like to see? I'll check if they're indexed in Splunk.

alaeth · ‎2017-07-14

edits to fstab don't seem to have done anything... /data still not mounted. Hers the contents:

$cat /etc/fstab
LABEL=33ea999f:data /data btrfs defaults,ro,recovery 0 0

running btrfs fi us /data results in:

Overall:
    Device size:                   4.00GiB
    Device allocated:              3.98GiB
    Device unallocated:           12.00MiB
    Device missing:                  0.00B
    Used:                          1.42GiB
    Free (estimated):              2.20GiB      (min: 2.20GiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:               16.00MiB      (used: 0.00B)

Data,single: Size:3.57GiB, Used:1.37GiB
   /dev/md0        3.57GiB

Metadata,DUP: Size:204.56MiB, Used:26.52MiB
   /dev/md0      409.12MiB

System,DUP: Size:8.00MiB, Used:16.00KiB
   /dev/md0       16.00MiB

Unallocated:
   /dev/md0       12.00MiB

alaeth · ‎2017-07-14

Tried mounting from the command-line (at this point I'm a bit like a bull in a china shop... barely understanding the commands I'm Googling)...

sudo mount -r --source LABEL=33ea999f:data /data
mount: wrong fs type, bad option, bad superblock on /dev/md127,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

Tailing the dmesg...

dmesg | tail
[ 1591.911995] BTRFS error (device md127): failed to read block groups: -5
[ 1591.912462] BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
[ 1591.912478] BTRFS error (device md127): failed to read block groups: -5
[ 1591.914185] BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
[ 1591.914200] BTRFS error (device md127): failed to read block groups: -5
[ 1591.916306] BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
[ 1591.916320] BTRFS error (device md127): failed to read block groups: -5
[ 1591.939468] BTRFS critical (device md127): corrupt leaf, slot offset bad: block=2291416563712, root=1, slot=77
[ 1591.939484] BTRFS error (device md127): failed to read block groups: -5
[ 1591.954033] BTRFS error (device md127): open_ctree failed

alaeth · ‎2017-07-23

I'm going to consider the data lost, and rebuild/factory reset back down to 4.2 for stability.

I think this is the answer without getting into costly data recovery services, or massive time investment.