Re: Removing inactive volumes to use disk (with no warnings)

bspachman · ‎2024-12-09

Whew--it's been a minute since I've been on these forums. Things have moved around a bit!

Still running my battle-tested ReadyNAS Pro. Worked fine on Saturday. Seemed to accomplish its weekly backup tasks overnight Saturday->Sunday. When I went to access the NAS Sunday night from various clients, it was not accessible.

I was able to log into the admin console and looking at System->Volumes showed the dreaded "Remove inactive volumes to use the disk. Disk #1,2,3,4,5,6,1,2,3,4". Before researching, I rebooted. Same symptoms. I downloaded the config ZIP and the log ZIP and shut down.

After reading several threads, it feels like the volume might be lost--though I would be happy to be wrong. Primary questions and guidance needed on:

1) Can the data volume be restored to full functionality?

2) Does it appear there is a hardware/disk problem in the machine? I'm not super-adept at deciphering the logs, but perhaps disk 4?

3) Any guesses as to why the machine worked 'fine' on Saturday, 7DEC and has gone belly-up on its next boot (8DEC)? I never received any messages about volume degradation or moving into read-only status. The data volume simply disappeared.

4) Best method to move forward? Replace hardware? If there's a single disk that's bad, will replacing it trigger a rebuild? Am I in for restoring from backup?

FWIW, the NAS has been running X-RAID with 4 4TB disks in bays 1-4; 2 2TB disks in bays 5-6. Running the current 6.10.10 OS.

Any guidance appreciated. I looked for the 'pay by incident' support option, but was not offered it due to the age (and even if I could call, I've got an "unsupported" OS on the machine).

All guidance gratefully accepted!

Thanks,

brad

StephenB · ‎2024-12-10

@bspachman wrote:

Still running my battle-tested ReadyNAS Pro. Worked fine on Saturday. Seemed to accomplish its weekly backup tasks overnight Saturday->Sunday. When I went to access the NAS Sunday night from various clients, it was not accessible.

There are multiple causes of this problem. Some can be repaired using tech support mode, some require connecting the disks to a Windows PC and purchasing RAID recovery software.

If you'd like, I can take a look at the log zip. You'd need to upload it to cloud storage, and PM me a link (using the envelope icon in the upper right of the forum page). Make sure the link permissions allow anyone with the link to download.

bspachman · ‎2024-12-10

@StephenB

Thanks for the offer and the subsequent pointers. I'll summarize here for future searching:

- boot_info.log, os_version.log, disk_info.log shows details about the hardware

- dmesg.log shows many things including the arrays (md[number here]) and which disks and partitions make up the arrays.

In my case, dmesg.log shows evidence of filesystem corruption:

"BTRFS error"

In looking at kernel.log, there are read and write errors with one of the ATA channels. To quote my log:

Dec 07 23:17:44 kernel: ata4.00: irq_stat 0x40000008
Dec 07 23:17:44 kernel: ata4.00: failed command: WRITE FPDMA QUEUED
Dec 07 23:17:44 kernel: ata4.00: cmd 61/08:80:48:a1:32/00:00:25:00:00/40 tag 16 ncq 4096 out
                                          res 41/10:00:48:a1:32/00:00:25:00:00/40 Emask 0x481 (invalid argument) <F>
Dec 07 23:17:44 kernel: ata4.00: status: { DRDY ERR }
Dec 07 23:17:44 kernel: ata4.00: error: { IDNF }
Dec 07 23:17:44 kernel: ata4.00: configured for UDMA/133
Dec 07 23:17:44 kernel: ata4: EH complete


...

Dec 07 23:20:33 kernel: ata4: failed to read log page 10h (errno=-5)
Dec 07 23:20:33 kernel: ata4.00: exception Emask 0x1 SAct 0x2bfa14a0 SErr 0x0 action 0x6 frozen
Dec 07 23:20:33 kernel: ata4.00: irq_stat 0x40000008
Dec 07 23:20:33 kernel: ata4.00: failed command: READ FPDMA QUEUED
Dec 07 23:20:33 kernel: ata4.00: cmd 60/80:28:30:9b:34/00:00:2f:01:00/40 tag 5 ncq 65536 in
                                          res 40/00:24:40:20:02/00:00:00:00:00/40 Emask 0x1 (device error)

In my case, it's disk 4 (ata4).

Thanks for the suggestions of next steps:

- Power down, remove disk 4, boot

- Perhaps move on to repairing the btrfs file system using "tech support mode"

- Perhaps extract the physical disks, install into USB/SATA adapters and see what ReclaiMe RAID recovery software can do.

I'll let folks know how it turns out, but feels like I will cross my fingers for booting without the problematic drive. I can then drop in a new drive and see if the array survives a resync.

What is repairing the file system? Who can point me towards 'tech support mode'? I'm no expert, but can slowly find my way around a ssh session. 🙂

Thanks for the guidance so far!

brad

bspachman · ‎2024-12-10

...and the very quick followup:

Removing the failed (failing) disk 4 has revived the entire volume. Of course, I'm getting the angry "data DEGRADED" message on the front panel, but that's to be expected since I've removed a disk from the chassis.

Besides my BTRFS question above, I noticed the following entries in the new kernel.log:

Dec 10 18:54:58 kernel: BTRFS critical (device md126): corrupt leaf: root=1 block=8855732846592 slot=0 ino=282248, name hash mismatch with key, have 0x00000000a0891b17 expect 0x0000000045922146
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732846592 (dev /dev/md127 sector 3537932224)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732850688 (dev /dev/md127 sector 3537932232)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732854784 (dev /dev/md127 sector 3537932240)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732858880 (dev /dev/md127 sector 3537932248)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732862976 (dev /dev/md127 sector 3537932256)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732867072 (dev /dev/md127 sector 3537932264)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732871168 (dev /dev/md127 sector 3537932272)
Dec 10 18:54:58 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732875264 (dev /dev/md127 sector 3537932280)
Dec 10 18:54:59 kernel: BTRFS error (device md126): parent transid verify failed on 8855732322304 wanted 228773 found 222978
Dec 10 18:54:59 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732322304 (dev /dev/md127 sector 3537931200)
Dec 10 18:54:59 kernel: BTRFS info (device md126): read error corrected: ino 1 off 8855732326400 (dev /dev/md127 sector 3537931208)

I'm happy to see corrected read errors :-), but am still a little worried about the out of sync transactions ("parent transid verify failed").

Any thoughts about that?

I'm holding off dropping my new drive into the chassis for a little while. My instinct is that it would be better to do any filesystem fixes before doing any kind of RAID re-sync by adding a new drive.

Thanks all!

brad

StephenB · ‎2024-12-11

@bspachman wrote:

My instinct is that it would be better to do any filesystem fixes before doing any kind of RAID re-sync by adding a new drive.

I don't think so, and personally would replace the drive now. At the moment you have no RAID redundancy, so any errors on another disk would result in data loss (likely of the entire volume). Good choices for a replacement would be a WD Red Plus or a Seagate Ironwolf. DON'T get a WD Red, as that uses SMR (shingled magnetic recording), which doesn't work well with ReadyNAS. (Your existing WD40EFRX disks are not SMR, so no worries there).

Just to create some context: The NAS uses mdadm linux RAID to create virtual disks - md126 and md127 in your case. The BTRFS file system is installed on those virtual disks. So BTRFS and RAID are independent. Getting RAID redundancy back would not impede any repair.

@bspachman wrote:

I'm happy to see corrected read errors :-), but am still a little worried about the out of sync transactions ("parent transid verify failed").

If you reboot the NAS, are you seeing any more of these errors?

How full is your data volume? BTRFS will misbehave if it gets too full. I generally recommend keeping at least 15% free space on the volume.

These errors were in your original logs - but were not corrected. It's hard to say now what caused them, but it almost certainly was a consequence of the disk i/o errors. Another possibility is lost cached disk writes due to a power cut or an unclean NAS shutdown.

I also recommend running the maintenance tasks on the volume settings wheel on a regular basis. I schedule one of those tasks each month (scrub, balance, disk test, defrag), so I run each of them 3x a year. Scrub also is a reasonable disk diag (because it reads every sector on every disks), so I place scrub and disk test two months apart. However, I wouldn't run these tasks until the RAID redundancy is restored and volume free space is greater than 15%.

bspachman · ‎2024-12-15

@StephenB ,

Thanks for the heads up on the WD Red drives. Looks like I missed that whole issue when it developed, and it turns out that the emergency drive I had ready and waiting is an SMR drive, where all my remaining drives are so old that even the small discs use CMR.

So...my RAID is off until a new CMR drive arrives this week. We'll see what happens with the rebuild after that, then see if I need any filesystem doctoring. I'm fine with regards to free space, but I'll double check before scheduling any of the disk maintenance tasks (which I have never run).

Ultimately, the backup I have of the RAID is up-to-date, so I'm still in a safe space with regards to my data.

Looks like 2025 is going to be year of gradually cycling the old drives out of the RAID and replacing them with new ones. All the drives in my ReadyNAS are at least 8 years old (judging by their manufacturing date)!

My biggest remaining question is if anyone has an idea why my array 'evaporated' instead of falling into a 'degraded' state when one of the disks started throwing errors?

Thanks again, and I'll update this thread when I have something to report.

bspachman · ‎2024-12-22

Coming back with an update (and some more questions):

- New CMR drive (WD Red Plus) arrived Friday, 20DEC.

- Booted up the ReadyNAS and after the boot process was complete, dropped the new drive into slot 4.

- Resync began as expected around 1905 (my time).

- Got an email from the ReadyNAS after about 12 hours (0625 on 21DEC) that "Volume data is resynced"; then "Volume data health changed from Degraded to Redundant."; finally "Disk in channel 4 (Internal) changed state from RESYNC to ONLINE."

- Yay!

I pulled a set of logs around 0923 and left the ReadyNAS running. However, later that night (suspiciously about 12 hour after the success messages), I received more emails:

- 1836: "Volume data health changed from Redundant to Degraded."

- 1836: "Disk in channel 1 (Internal) changed state from ONLINE to FAILED."

I didn't see those notifications for a while, so I didn't shut down the ReadyNAS until about 2222. I powered it up once today (22DEC) to pull logs after the latest failure.

I know that resync operations can be dangerous when in a degraded state--particularly with older (possibly suspect) hardware, but does it seem likely that the resync completed fine, and then 12 hours later another disk died?

Any advice as to next steps will be welcome....

Thanks!

StephenB · ‎2024-12-23

@bspachman wrote:

I didn't see those notifications for a while, so I didn't shut down the ReadyNAS until about 2222. I powered it up once today (22DEC) to pull logs after the latest failure.

I know that resync operations can be dangerous when in a degraded state--particularly with older (possibly suspect) hardware, but does it seem likely that the resync completed fine, and then 12 hours later another disk died?

Any advice as to next steps will be welcome....

If you send me a PM with the fresh logs, I can take a look.

StephenB · ‎2024-12-24

@StephenB wrote:

@bspachman wrote:

I didn't see those notifications for a while, so I didn't shut down the ReadyNAS until about 2222. I powered it up once today (22DEC) to pull logs after the latest failure.

I know that resync operations can be dangerous when in a degraded state--particularly with older (possibly suspect) hardware, but does it seem likely that the resync completed fine, and then 12 hours later another disk died?

Any advice as to next steps will be welcome....

If you send me a PM with the fresh logs, I can take a look.

sda (disk 1) started generating errors towards the end the resync (which finished at 24/12/21 06:25:36), so it is a bit surprising that the resync actually completed.

Dec 21 06:04:38 HTPC-NAS kernel: do_marvell_9170_recover: ignoring PCI device (8086:2821) at PCI#0
Dec 21 06:04:38 HTPC-NAS kernel: ata1.00: exception Emask 0x0 SAct 0x18 SErr 0x0 action 0x0
Dec 21 06:04:38 HTPC-NAS kernel: ata1.00: irq_stat 0x40000008
Dec 21 06:04:38 HTPC-NAS kernel: ata1.00: failed command: READ FPDMA QUEUED
Dec 21 06:04:38 HTPC-NAS kernel: ata1.00: cmd 60/08:20:30:23:8f/00:00:f3:00:00/40 tag 4 ncq 4096 in
                                          res 41/40:00:30:23:8f/00:00:f3:00:00/40 Emask 0x409 (media error) <F>
Dec 21 06:04:38 HTPC-NAS kernel: ata1.00: status: { DRDY ERR }
Dec 21 06:04:38 HTPC-NAS kernel: ata1.00: error: { UNC }
Dec 21 06:04:38 HTPC-NAS kernel: ata1.00: configured for UDMA/133
Dec 21 06:04:38 HTPC-NAS kernel: sd 0:0:0:0: [sda] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 21 06:04:38 HTPC-NAS kernel: sd 0:0:0:0: [sda] tag#4 Sense Key : Medium Error [current] [descriptor] 
Dec 21 06:04:38 HTPC-NAS kernel: sd 0:0:0:0: [sda] tag#4 Add. Sense: Unrecovered read error - auto reallocate failed
Dec 21 06:04:38 HTPC-NAS kernel: sd 0:0:0:0: [sda] tag#4 CDB: Read(16) 88 00 00 00 00 00 f3 8f 23 30 00 00 00 08 00 00
Dec 21 06:04:38 HTPC-NAS kernel: blk_update_request: I/O error, dev sda, sector 4086244144
Dec 21 06:04:38 HTPC-NAS kernel: md/raid:md127: read error not correctable (sector 179219072 on sda4).

It is possible that there is some file system corruption, though that depends on whether anything was actually stored on the sectors that couldn't be read.

In any event, it does need to be replaced.

bspachman · ‎2025-01-01

After shipping delays, Disk 1 is now replaced. Another WD Red Plus. Resync triggered fine and completed overnight.

However, the saga continues, because now the volume is "Read Only". That email error came through about 6.5hours into the resync. Sigh.

I pulled another set of logs this morning (no reboots yet). Definitely seem to be some BTRFS errors listed, but I'm wondering if there are any other indications of hardware trouble.

If you don't mind taking (yet) another look, I would appreciate it.

It certainly feels like I'm heading towards a "Destroy Volume" moment, or even a full factory reset, but I don't want to take either of those steps until we think the underlying hardware issues are sorted.

Thank you for any additional insight!

StephenB · ‎2025-01-01

@bspachman wrote:

It certainly feels like I'm heading towards a "Destroy Volume" moment, or even a full factory reset, but I don't want to take either of those steps until we think the underlying hardware issues are sorted.

I can look at the logs. I suspect that won't uncover any hardware issues.

StephenB · ‎2025-01-01

@StephenB wrote:

@bspachman wrote:

It certainly feels like I'm heading towards a "Destroy Volume" moment, or even a full factory reset, but I don't want to take either of those steps until we think the underlying hardware issues are sorted.

I can look at the logs. I suspect that won't uncover any hardware issues.

Actually I am seeing two disks generating some UNCs.

sda (serial WD-WCC4E6CTKYVJ)
sdd (serial WD-WMC300968738)

You might want to test them. I'm not sure if the system will let you run the disk test in the volume settings wheel when the volume is read-only. If it does, then I suggest doing that. The test uses the smartctl long test on all disks in the array in parallel - it takes a while (hours) to finish.

bspachman · ‎2025-01-05

@StephenB wrote:
You might want to test them. I'm not sure if the system will let you run the disk test in the volume settings wheel when the volume is read-only. If it does, then I suggest doing that. The test uses the smartctl long test on all disks in the array in parallel - it takes a while (hours) to finish.

Due to my schedule, I'm only able to work on this problem intermittently (hence the delay in replies/confirmations). 🙂

In this specific case, I was able to run the disk test while the volume was mounted read-only. The disk test confirmed what is visible in the logs--namely that those 2 disks are failing (failed).

Since that makes 4 of 6 disks showing hardware problems, I'm simply going to replace all of the disks. All the original disks are from around the same time period, and even though the ReadyNAS only runs a few hours a day, having the original disks last 7-9 years is a darned good run.

The original disks were a mix of capacities (4 @ 4TB; 2 @ 2TB), so I'm going to match the whole set this time (6 @ 4TB).

What's my most efficient way forward?

- Since I've already replaced and rebuilt 2 disks, can I simply power down, replace the remaining 4 disks all at once, create a new volume and restore from backup?

- Or do I need to perform a factory reset procedure before (or after) getting all the new disks in place?

Guidance on the best use of time will be gratefully appreciated (as are links to the best order of operations).

Thanks!

StephenB · ‎2025-01-05

@bspachman wrote:

What's my most efficient way forward?

- Since I've already replaced and rebuilt 2 disks, can I simply power down, replace the remaining 4 disks all at once, create a new volume and restore from backup?

- Or do I need to perform a factory reset procedure before (or after) getting all the new disks in place?

I suggest installing the all the new drives at once, and then do the factory reset. You will be able to do that from the web UI System->settings page. You don't need to wait for the initial sync to complete before you do the reset.

Do you need to reinstall any apps on the NAS? If so, which ones?

bspachman · ‎2025-01-05

@StephenB wrote:
Do you need to reinstall any apps on the NAS? If so, which ones?

Nope--no apps, just the 'regular' configuration of shares, users, backup tasks, etc. I do have ssh access enabled, and I do use one of the cloud backup services, but I think all of that setup is also included in the archived CONFIG zip that I downloaded. So, to double-check:

- Install new drives

- Factory reset (from web admin UI)

- Restore config from previously downloaded zip file

- Make new volume

- Restore from backup drives

Looks like I know what I'm doing next weekend <sigh> 🙂

Thank you for the help and guidance!