Re: Removing inactive volumes to use disk (with no warnings)

bspachman · ‎2024-12-09

Whew--it's been a minute since I've been on these forums. Things have moved around a bit!

Still running my battle-tested ReadyNAS Pro. Worked fine on Saturday. Seemed to accomplish its weekly backup tasks overnight Saturday->Sunday. When I went to access the NAS Sunday night from various clients, it was not accessible.

I was able to log into the admin console and looking at System->Volumes showed the dreaded "Remove inactive volumes to use the disk. Disk #1,2,3,4,5,6,1,2,3,4". Before researching, I rebooted. Same symptoms. I downloaded the config ZIP and the log ZIP and shut down.

After reading several threads, it feels like the volume might be lost--though I would be happy to be wrong. Primary questions and guidance needed on:

1) Can the data volume be restored to full functionality?

2) Does it appear there is a hardware/disk problem in the machine? I'm not super-adept at deciphering the logs, but perhaps disk 4?

3) Any guesses as to why the machine worked 'fine' on Saturday, 7DEC and has gone belly-up on its next boot (8DEC)? I never received any messages about volume degradation or moving into read-only status. The data volume simply disappeared.

4) Best method to move forward? Replace hardware? If there's a single disk that's bad, will replacing it trigger a rebuild? Am I in for restoring from backup?

FWIW, the NAS has been running X-RAID with 4 4TB disks in bays 1-4; 2 2TB disks in bays 5-6. Running the current 6.10.10 OS.

Any guidance appreciated. I looked for the 'pay by incident' support option, but was not offered it due to the age (and even if I could call, I've got an "unsupported" OS on the machine).

All guidance gratefully accepted!

Thanks,

brad

StephenB · ‎2024-12-10

@bspachman wrote:

Still running my battle-tested ReadyNAS Pro. Worked fine on Saturday. Seemed to accomplish its weekly backup tasks overnight Saturday->Sunday. When I went to access the NAS Sunday night from various clients, it was not accessible.

There are multiple causes of this problem. Some can be repaired using tech support mode, some require connecting the disks to a Windows PC and purchasing RAID recovery software.

If you'd like, I can take a look at the log zip. You'd need to upload it to cloud storage, and PM me a link (using the envelope icon in the upper right of the forum page). Make sure the link permissions allow anyone with the link to download.

bspachman · ‎2024-12-10

@StephenB

Thanks for the offer and the subsequent pointers. I'll summarize here for future searching:

- boot_info.log, os_version.log, disk_info.log shows details about the hardware

- dmesg.log shows many things including the arrays (md[number here]) and which disks and partitions make up the arrays.

In my case, dmesg.log shows evidence of filesystem corruption:

"BTRFS error"

In looking at kernel.log, there are read and write errors with one of the ATA channels. To quote my log:

Dec 07 23:17:44 kernel: ata4.00: irq_stat 0x40000008
Dec 07 23:17:44 kernel: ata4.00: failed command: WRITE FPDMA QUEUED
Dec 07 23:17:44 kernel: ata4.00: cmd 61/08:80:48:a1:32/00:00:25:00:00/40 tag 16 ncq 4096 out
                                          res 41/10:00:48:a1:32/00:00:25:00:00/40 Emask 0x481 (invalid argument) <F>
Dec 07 23:17:44 kernel: ata4.00: status: { DRDY ERR }
Dec 07 23:17:44 kernel: ata4.00: error: { IDNF }
Dec 07 23:17:44 kernel: ata4.00: configured for UDMA/133
Dec 07 23:17:44 kernel: ata4: EH complete


...

Dec 07 23:20:33 kernel: ata4: failed to read log page 10h (errno=-5)
Dec 07 23:20:33 kernel: ata4.00: exception Emask 0x1 SAct 0x2bfa14a0 SErr 0x0 action 0x6 frozen
Dec 07 23:20:33 kernel: ata4.00: irq_stat 0x40000008
Dec 07 23:20:33 kernel: ata4.00: failed command: READ FPDMA QUEUED
Dec 07 23:20:33 kernel: ata4.00: cmd 60/80:28:30:9b:34/00:00:2f:01:00/40 tag 5 ncq 65536 in
                                          res 40/00:24:40:20:02/00:00:00:00:00/40 Emask 0x1 (device error)

In my case, it's disk 4 (ata4).

Thanks for the suggestions of next steps:

- Power down, remove disk 4, boot

- Perhaps move on to repairing the btrfs file system using "tech support mode"

- Perhaps extract the physical disks, install into USB/SATA adapters and see what ReclaiMe RAID recovery software can do.

I'll let folks know how it turns out, but feels like I will cross my fingers for booting without the problematic drive. I can then drop in a new drive and see if the array survives a resync.

What is repairing the file system? Who can point me towards 'tech support mode'? I'm no expert, but can slowly find my way around a ssh session. 🙂

Thanks for the guidance so far!

brad

bspachman · ‎2024-12-10

...and the very quick followup:

Removing the failed (failing) disk 4 has revived the entire volume. Of course, I'm getting the angry "data DEGRADED" message on the front panel, but that's to be expected since I've removed a disk from the chassis.

Besides my BTRFS question above, I noticed the following entries in the new kernel.log:

Dec 10 18:54:58 kernel: BTRFS critical (device md126): corrupt leaf: root=1 block=8855732846592 slot=0 ino=282248, name hash mismatch with key, have 0x00000000a0891b17 expect 0x0000000045922146
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732846592 (dev /dev/md127 sector 3537932224)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732850688 (dev /dev/md127 sector 3537932232)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732854784 (dev /dev/md127 sector 3537932240)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732858880 (dev /dev/md127 sector 3537932248)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732862976 (dev /dev/md127 sector 3537932256)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732867072 (dev /dev/md127 sector 3537932264)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732871168 (dev /dev/md127 sector 3537932272)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732875264 (dev /dev/md127 sector 3537932280)
Dec 10 18:54:59 kernel: BTRFS error (device md126): parent transid verify failed on 8855732322304 wanted 228773 found 222978
Dec 10 18:54:59 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732322304 (dev /dev/md127 sector 3537931200)
Dec 10 18:54:59 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732326400 (dev /dev/md127 sector 3537931208)

I'm happy to see corrected read errors :-), but am still a little worried about the out of sync transactions ("parent transid verify failed").

Any thoughts about that?

I'm holding off dropping my new drive into the chassis for a little while. My instinct is that it would be better to do any filesystem fixes before doing any kind of RAID re-sync by adding a new drive.

Thanks all!

brad

StephenB · ‎2024-12-11

@bspachman wrote:

My instinct is that it would be better to do any filesystem fixes before doing any kind of RAID re-sync by adding a new drive.

I don't think so, and personally would replace the drive now. At the moment you have no RAID redundancy, so any errors on another disk would result in data loss (likely of the entire volume). Good choices for a replacement would be a WD Red Plus or a Seagate Ironwolf. DON'T get a WD Red, as that uses SMR (shingled magnetic recording), which doesn't work well with ReadyNAS. (Your existing WD40EFRX disks are not SMR, so no worries there).

Just to create some context: The NAS uses mdadm linux RAID to create virtual disks - md126 and md127 in your case. The BTRFS file system is installed on those virtual disks. So BTRFS and RAID are independent. Getting RAID redundancy back would not impede any repair.

@bspachman wrote:

I'm happy to see corrected read errors :-), but am still a little worried about the out of sync transactions ("parent transid verify failed").

If you reboot the NAS, are you seeing any more of these errors?

How full is your data volume? BTRFS will misbehave if it gets too full. I generally recommend keeping at least 15% free space on the volume.

These errors were in your original logs - but were not corrected. It's hard to say now what caused them, but it almost certainly was a consequence of the disk i/o errors. Another possibility is lost cached disk writes due to a power cut or an unclean NAS shutdown.

I also recommend running the maintenance tasks on the volume settings wheel on a regular basis. I schedule one of those tasks each month (scrub, balance, disk test, defrag), so I run each of them 3x a year. Scrub also is a reasonable disk diag (because it reads every sector on every disks), so I place scrub and disk test two months apart. However, I wouldn't run these tasks until the RAID redundancy is restored and volume free space is greater than 15%.