Forum Discussion

Aspirant

May 04, 2021

rndp6000 inactive volume data

Hello there today I suddenly lost access to files over the network, and the NAS was totally freezed. so i force rebooted the system and opened admin console. there was message that volume data is in...

Feature Request & Feedback

Archdivan

Aspirant

May 06, 2021

Stephen, great thanks for replying!

at the moment I have tried two ways to check disks: using smartctl command via ssh and disk chesk from the boot menu. SMART looks normal with no obvious critical errors. If it necessary, I can send a report. Also i found some errors in dmesg.log

[Thu May 6 03:47:20 2021] RAID conf printout:
[Thu May 6 03:47:20 2021] --- level:5 rd:5 wd:5
[Thu May 6 03:47:20 2021] disk 0, o:1, dev:sda3
[Thu May 6 03:47:20 2021] disk 1, o:1, dev:sdb3
[Thu May 6 03:47:20 2021] disk 2, o:1, dev:sdc3
[Thu May 6 03:47:20 2021] disk 3, o:1, dev:sdd3
[Thu May 6 03:47:20 2021] disk 4, o:1, dev:sde3
[Thu May 6 03:47:20 2021] md126: detected capacity change from 0 to 1980566863872
[Thu May 6 03:47:20 2021] BTRFS: device label 33ea3ec3:data devid 1 transid 1021063 /dev/md126
[Thu May 6 03:47:21 2021] BTRFS info (device md126): has skinny extents
[Thu May 6 03:47:39 2021] BTRFS critical (device md126): corrupt leaf, slot offset bad: block=2977215774720, root=1, slot=131
[Thu May 6 03:47:39 2021] BTRFS critical (device md126): corrupt leaf, slot offset bad: block=2977215774720, root=1, slot=131
[Thu May 6 03:47:41 2021] BTRFS critical (device md126): corrupt leaf, slot offset bad: block=2977215774720, root=1, slot=131
[Thu May 6 03:47:41 2021] BTRFS critical (device md126): corrupt leaf, slot offset bad: block=2977215774720, root=1, slot=131
[Thu May 6 03:47:41 2021] BTRFS warning (device md126): Skipping commit of aborted transaction.
[Thu May 6 03:47:41 2021] BTRFS: error (device md126) in cleanup_transaction:1864: errno=-5 IO failure
[Thu May 6 03:47:41 2021] BTRFS info (device md126): delayed_refs has NO entry
[Thu May 6 03:47:41 2021] BTRFS: error (device md126) in btrfs_replay_log:2436: errno=-5 IO failure (Failed to recover log tree)
[Thu May 6 03:47:41 2021] BTRFS error (device md126): cleaner transaction attach returned -30
[Thu May 6 03:47:41 2021] BTRFS error (device md126): open_ctree failed
[Thu May 6 03:47:42 2021] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[Thu May 6 03:47:42 2021] NFSD: starting 90-second grace period (net ffffffff88d70240)
[Thu May 6 03:48:01 2021] nfsd: last server has exited, flushing export cache
[Thu May 6 03:48:01 2021] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[Thu May 6 03:48:01 2021] NFSD: starting 90-second grace period (net ffffffff88d70240)
[Thu May 6 05:38:06 2021] RPC: fragment too large: 50331695
[Thu May 6 08:13:35 2021] RPC: fragment too large: 50331695

It looks like corruption in filesystem.

Some googling around tells me that it can be repaired with btrfs but i exactly don't understand how to do this.

StephenB

Guru - Experienced User

May 06, 2021

Archdivan wrote:

[Thu May 6 03:47:41 2021] BTRFS: error (device md126) in cleanup_transaction:1864: errno=-5 IO failure

[Thu May 6 03:47:41 2021] BTRFS: error (device md126) in btrfs_replay_log:2436: errno=-5 IO failure (Failed to recover log tree)

It looks like corruption in filesystem.

It does look like corruption, but I suggest first looking in system.log and kernel.log for more info on these two particular errors. They look disk-related, and those two logs might give you more info on what disk caused them. Once you know the disk, you could attempt to boot the system up w/o it, and see if the volume mounts.

BTW, if it does mount, then the next step would be to off-load data (at least the most critical stuff).

Archdivan wrote:

Some googling around tells me that it can be repaired with btrfs but i exactly don't understand how to do this.

Though I've done some reading here, it's not something I've needed to do myself. rn_enthusiast has sometimes offered to review logs and offer tailored advice on this. So maybe wait and see if he will chime in.

rn_enthusiast
Virtuoso
May 06, 2021
It looks like a failure to read the filesystem journal.
[Thu May 6 03:47:41 2021] BTRFS: error (device md126) in btrfs_replay_log:2436: errno=-5 IO failure (Failed to recover log tree)
[Thu May 6 03:47:41 2021] BTRFS error (device md126): cleaner transaction attach returned -30
[Thu May 6 03:47:41 2021] BTRFS error (device md126): open_ctree failed
Given the symptoms of the NAS freezing then seeing the journal (tree-log) preventing the mount on next boot, I would expect that a btrfs zero-log will likely fix it.
btrfs rescue zero-log /dev/md126

Then reboot the NAS

But the thing is, it is an estimation of what the problem is. It LOOKS like a bad journal more than anything but it could potentially be other things too. Clearing the journal is only good in cases where the journal is actually the issue :). It does look like, that the journal is the problem here but there are people out there who knows a lot more about BTRFS than me. If you care about the data or don't have a backup, you might seek advise from the BTRFS community first. You can use the BTRFS mailing list:
https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list

They would expect you know some Linux at least and please look at the section about which info to get for them (in the above link).

If you just want to "try something", then zero-log would probably be the way to go, though would be at your own descresion.

Cheers
- StephenB
  Guru - Experienced User
  May 06, 2021
  rn_enthusiast wrote:
  
  Given the symptoms of the NAS freezing then seeing the journal (tree-log) preventing the mount on next boot, I would expect that a btrfs zero-log will likely fix it.
  
  I am wondering if the I/O error was simply due to a bad pointer in the journal, or if there was a disk error when reading the data structure.
  - rn_enthusiast
    Virtuoso
    May 07, 2021
    Exactly, I think it is a bad pointer in the tree log yea. Though you are right, in typical circumstances that can be associated with bad hardware. It would just be some coincidence to have bad HW + a bad journal :) My feeling is that this is tree log issues here. Had a look around the web as well, I see others in exact same situation and it was journal problems for them too. For example here:
    https://unix.stackexchange.com/questions/369520/btrfs-super-recover-says-all-superblocks-are-good-but-mount-disagrees
    
    I doesn't seem uncommon to experience the bad/corrupt leaf error at the same time.
- Archdivan
  Aspirant
  May 08, 2021
  Wow. It works. Now i have mounted read-only volume. At least i can make a backup of my data before next experiments. I'm not a linux user, and your advice really saved me. So next sten i should find hdd causes an error, remove it and totally rebuild a volume?
  At the kernel.log still "corrupt leaf" error at md126:
  ay 06 23:09:57 ArchiBox kernel: RAID conf printout:
  May 06 23:09:57 ArchiBox kernel: --- level:5 rd:5 wd:5
  May 06 23:09:57 ArchiBox kernel: disk 0, o:1, dev:sda3
  May 06 23:09:57 ArchiBox kernel: disk 1, o:1, dev:sdb3
  May 06 23:09:57 ArchiBox kernel: disk 2, o:1, dev:sdc3
  May 06 23:09:57 ArchiBox kernel: disk 3, o:1, dev:sdd3
  May 06 23:09:57 ArchiBox kernel: disk 4, o:1, dev:sde3
  May 06 23:09:57 ArchiBox kernel: md126: detected capacity change from 0 to 1980566863872
  May 06 23:09:57 ArchiBox kernel: BTRFS: device label 33ea3ec3:data devid 1 transid 1021064 /dev/md126
  May 06 23:09:57 ArchiBox kernel: BTRFS info (device md126): has skinny extents
  May 06 23:10:14 ArchiBox kernel: BTRFS error (device md126): qgroup generation mismatch, marked as inconsistent
  May 06 23:10:14 ArchiBox kernel: BTRFS info (device md126): checking UUID tree
  May 06 23:10:15 ArchiBox kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
  May 06 23:10:15 ArchiBox kernel: NFSD: starting 90-second grace period (net ffffffff88d70240)
  May 06 23:10:18 ArchiBox kernel: BTRFS critical (device md126): corrupt leaf, slot offset bad: block=2977215774720, root=1, slot=131
  May 06 23:10:19 ArchiBox kernel: BTRFS critical (device md126): corrupt leaf, slot offset bad: block=2977215774720, root=1, slot=131
  May 06 23:10:39 ArchiBox kernel: nfsd: last server has exited, flushing export cache
  May 06 23:10:39 ArchiBox kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
  May 06 23:10:39 ArchiBox kernel: NFSD: starting 90-second grace period (net ffffffff88d70240)
  May 06 23:11:07 ArchiBox kernel: BTRFS critical (device md126): corrupt leaf, slot offset bad: block=2977215774720, root=1, slot=131
  May 06 23:11:07 ArchiBox kernel: BTRFS critical (device md126): corrupt leaf, slot offset bad: block=2977215774720, root=1, slot=131
  May 06 23:11:07 ArchiBox kernel: BTRFS: error (device md126) in btrfs_drop_snapshot:9412: errno=-5 IO failure
  May 06 23:11:07 ArchiBox kernel: BTRFS info (device md126): forced readonly
  May 06 23:11:07 ArchiBox kernel: BTRFS info (device md126): delayed_refs has NO entry
  
  At the system.log i found a lot of such type of errors:
  
  May 07 00:00:09 ArchiBox dbus[2865]: [system] Activating service name='org.opensuse.Snapper' (using servicehelper)
  May 07 00:00:09 ArchiBox dbus[2865]: [system] Successfully activated service 'org.opensuse.Snapper'
  May 07 00:00:09 ArchiBox snapperd[11094]: loading 305 failed
  May 07 00:00:09 ArchiBox snapperd[11094]: loading 408 failed
  May 07 00:00:09 ArchiBox snapperd[11094]: loading 465 failed
  May 07 00:00:09 ArchiBox snapperd[11094]: loading 528 failed
  May 07 00:00:09 ArchiBox snapperd[11094]: THROW: mkdir failed errno:30 (Read-only file system)
  May 07 00:00:09 ArchiBox snapperd[11094]: CAUGHT: mkdir failed errno:30 (Read-only file system)
  May 07 00:00:09 ArchiBox snapperd[11094]: loading 379 failed
  May 07 00:00:09 ArchiBox snapperd[11094]: loading 482 failed
  May 07 00:00:09 ArchiBox snapperd[11094]: loading 601 failed
  May 07 00:00:09 ArchiBox snapperd[11094]: loading 602 failed
  May 07 00:00:09 ArchiBox snapperd[11094]: loading 603 failed
  May 07 00:00:09 ArchiBox snapperd[11094]: loading 604 failed
  
  May 07 00:10:56 ArchiBox snapperd[12595]: delete snapshot failed, ioctl(BTRFS_IOC_SNAP_DESTROY) failed, errno:30 (Read-only file system)
  May 07 00:10:56 ArchiBox snapperd[12595]: THROW: delete snapshot failed
  May 07 00:10:56 ArchiBox snapperd[12595]: CAUGHT: delete snapshot failed
  May 07 00:10:57 ArchiBox snapperd[12595]: delete snapshot failed, ioctl(BTRFS_IOC_SNAP_DESTROY) failed, errno:30 (Read-only file system)
  May 07 00:10:57 ArchiBox snapperd[12595]: THROW: delete snapshot failed
  May 07 00:10:57 ArchiBox snapperd[12595]: CAUGHT: delete snapshot failed
  May 07 00:10:58 ArchiBox snapperd[12595]: delete snapshot failed, ioctl(BTRFS_IOC_SNAP_DESTROY) failed, errno:30 (Read-only file system)
  May 07 00:10:58 ArchiBox snapperd[12595]: THROW: delete snapshot failed
  May 07 00:10:58 ArchiBox snapperd[12595]: CAUGHT: delete snapshot failed
  
  Something wrong with a snapshots. may be i should delete them all..
  Anyway great thanks to you!
  - StephenB
    Guru - Experienced User
    May 08, 2021
    Archdivan wrote:
    
    Wow. It works. Now i have mounted read-only volume. At least i can make a backup of my data before next experiments.
    
    So next step i should find hdd causes an error, remove it and totally rebuild a volume?
    
    That makes sense to me. Though the disks could be fine, they should be tested. Rebuilding the volume (not just resyncing) is the the safest way to make sure all these errors are cleaned up.
    
    After the backup, I suggest running the disk test from the volume settings wheel. That will run the extended SMART test on all drives in the volume - that won't pick up all failures (I have had bad drives pass that test), but it still is pretty good.
    
    Then you could try either destroying/recreating the volume, or doing a factory default (rebuilding the NAS from scratch). They are about the same amount of work (either way you need to reinstall any apps, recreate all the shares, and restore all data from backup), so I'd go with the factory default myself.
    
    Another path is to connect the disks to a Windows PC after the disk test (either with SATA or a USB adapter/dock) after the disk test. Then run the write zeros / erase disk test in the vendor diagnostics (Seatools for Seagate, the Digital Dashboard for Western Digital). That will pick up any bad sectors that the non-destructive SMART test misses. After that completes, you put the disks back in the NAS and power up. The NAS will do a fresh factory install - and as above, you'd reconfigure the NAS and restore data from backup.