Removing inactive volumes to use disk (with no warnings)

bspachman
Aspirant
Dec 10, 2024
StephenB

Thanks for the offer and the subsequent pointers. I'll summarize here for future searching:
- boot_info.log, os_version.log, disk_info.log shows details about the hardware
- dmesg.log shows many things including the arrays (md[number here]) and which disks and partitions make up the arrays.

In my case, dmesg.log shows evidence of filesystem corruption:
"BTRFS error"

In looking at kernel.log, there are read and write errors with one of the ATA channels. To quote my log:
Dec 07 23:17:44 kernel: ata4.00: irq_stat 0x40000008 Dec 07 23:17:44 kernel: ata4.00: failed command: WRITE FPDMA QUEUED Dec 07 23:17:44 kernel: ata4.00: cmd 61/08:80:48:a1:32/00:00:25:00:00/40 tag 16 ncq 4096 out res 41/10:00:48:a1:32/00:00:25:00:00/40 Emask 0x481 (invalid argument) <F> Dec 07 23:17:44 kernel: ata4.00: status: { DRDY ERR } Dec 07 23:17:44 kernel: ata4.00: error: { IDNF } Dec 07 23:17:44 kernel: ata4.00: configured for UDMA/133 Dec 07 23:17:44 kernel: ata4: EH complete ... Dec 07 23:20:33 kernel: ata4: failed to read log page 10h (errno=-5) Dec 07 23:20:33 kernel: ata4.00: exception Emask 0x1 SAct 0x2bfa14a0 SErr 0x0 action 0x6 frozen Dec 07 23:20:33 kernel: ata4.00: irq_stat 0x40000008 Dec 07 23:20:33 kernel: ata4.00: failed command: READ FPDMA QUEUED Dec 07 23:20:33 kernel: ata4.00: cmd 60/80:28:30:9b:34/00:00:2f:01:00/40 tag 5 ncq 65536 in res 40/00:24:40:20:02/00:00:00:00:00/40 Emask 0x1 (device error)
In my case, it's disk 4 (ata4).

Thanks for the suggestions of next steps:
- Power down, remove disk 4, boot
- Perhaps move on to repairing the btrfs file system using "tech support mode"
- Perhaps extract the physical disks, install into USB/SATA adapters and see what ReclaiMe RAID recovery software can do.

I'll let folks know how it turns out, but feels like I will cross my fingers for booting without the problematic drive. I can then drop in a new drive and see if the array survives a resync.

What is repairing the file system? Who can point me towards 'tech support mode'? I'm no expert, but can slowly find my way around a ssh session. 🙂

Thanks for the guidance so far!
brad
- bspachman
  Aspirant
  Dec 11, 2024
  ...and the very quick followup:
  Removing the failed (failing) disk 4 has revived the entire volume. Of course, I'm getting the angry "data DEGRADED" message on the front panel, but that's to be expected since I've removed a disk from the chassis.
  
  Besides my BTRFS question above, I noticed the following entries in the new kernel.log:
  Dec 10 18:54:58 kernel: BTRFS critical (device md126): corrupt leaf: root=1 block=8855732846592 slot=0 ino=282248, name hash mismatch with key, have 0x00000000a0891b17 expect 0x0000000045922146
  Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732846592 (dev /dev/md127 sector 3537932224)
  Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732850688 (dev /dev/md127 sector 3537932232)
  Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732854784 (dev /dev/md127 sector 3537932240)
  Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732858880 (dev /dev/md127 sector 3537932248)
  Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732862976 (dev /dev/md127 sector 3537932256)
  Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732867072 (dev /dev/md127 sector 3537932264)
  Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732871168 (dev /dev/md127 sector 3537932272)
  Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732875264 (dev /dev/md127 sector 3537932280)
  Dec 10 18:54:59 kernel: BTRFS error (device md126): parent transid verify failed on 8855732322304 wanted 228773 found 222978
  Dec 10 18:54:59 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732322304 (dev /dev/md127 sector 3537931200)
  Dec 10 18:54:59 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732326400 (dev /dev/md127 sector 3537931208)
  I'm happy to see corrected read errors :-), but am still a little worried about the out of sync transactions ("parent transid verify failed").
  
  Any thoughts about that?
  
  I'm holding off dropping my new drive into the chassis for a little while. My instinct is that it would be better to do any filesystem fixes before doing any kind of RAID re-sync by adding a new drive.
  
  Thanks all!
  brad
  - StephenB
    Guru - Experienced User
    Dec 11, 2024
    bspachman wrote:
    
    My instinct is that it would be better to do any filesystem fixes before doing any kind of RAID re-sync by adding a new drive.
    
    I don't think so, and personally would replace the drive now. At the moment you have no RAID redundancy, so any errors on another disk would result in data loss (likely of the entire volume). Good choices for a replacement would be a WD Red Plus or a Seagate Ironwolf. DON'T get a WD Red, as that uses SMR (shingled magnetic recording), which doesn't work well with ReadyNAS. (Your existing WD40EFRX disks are not SMR, so no worries there).
    
    Just to create some context: The NAS uses mdadm linux RAID to create virtual disks - md126 and md127 in your case. The BTRFS file system is installed on those virtual disks. So BTRFS and RAID are independent. Getting RAID redundancy back would not impede any repair.
    
    bspachman wrote:
    
    I'm happy to see corrected read errors :-), but am still a little worried about the out of sync transactions ("parent transid verify failed").
    
    If you reboot the NAS, are you seeing any more of these errors?
    
    How full is your data volume? BTRFS will misbehave if it gets too full. I generally recommend keeping at least 15% free space on the volume.
    
    These errors were in your original logs - but were not corrected. It's hard to say now what caused them, but it almost certainly was a consequence of the disk i/o errors. Another possibility is lost cached disk writes due to a power cut or an unclean NAS shutdown.
    
    I also recommend running the maintenance tasks on the volume settings wheel on a regular basis. I schedule one of those tasks each month (scrub, balance, disk test, defrag), so I run each of them 3x a year. Scrub also is a reasonable disk diag (because it reads every sector on every disks), so I place scrub and disk test two months apart. However, I wouldn't run these tasks until the RAID redundancy is restored and volume free space is greater than 15%.

Forum Discussion

Removing inactive volumes to use disk (with no warnings)

Related Content

RN104 Remove inactive volumes Disk 3,4

Remove inactive volumes

Remove inactive volume to use disk 2

Remove inactive volumes - RN314

ReadyNAS 214 "Remove inactive volumes to use the disk. Disk #1,2,3,4."

NETGEAR Academy

ProSupport for Business