Procedure for responding to a disk failure

WhySoManyUserna · ‎2019-08-27

Hi, tonight I got an automated email titled "Disk removal detected.", went to RAIDar and the widget displayed this error message "disk5: device has failed. Data most certainly lost." I've had this device for years, and have been very lucky to avoid any issue until now. So before I start yanking/replacing disks and maybe making things worse, I wanted to check if there any how-to's / procedure guides for responding to a probable disk failure.

If my (possibly) outdated build documentation is correct, the device is configured as follows:

Configuration: RAID Level 5, 4 disks
Status: Redundant
Ch 1: Disk in, allocated to volume
Ch 2: Disk in, allocated to volume
Ch 3: Disk in, allocated to volume
Ch 4: Disk in, allocated to volume
Ch 5: Disk in, unallocated
Ch 6: Disk in, unallocated

Sandshark · ‎2019-08-27

If drive 5 was not part of the array, there would be no chance of data loss. Assuming you had drives 5 and 6 as spares, it sounds as if at least one, perhaps both, have been changed to active over the years you've had it You are going to need more information than RAIDar can provide to determine what is going on. At a minimum, the data from the GUI, including the log, and possibly from other logs in the log download .zip.

StephenB · ‎2019-08-28

Please do download the log zip file, and copy/paste the contents of mdstat.log in a reply here. The "insert code" tool - </> in the posting toolbar - is a good way to do that.

WhySoManyUserna · ‎2019-08-28

Fourth attempt to reply to responses. For some reason my posts look like they are saved to this thread, but when I reload the page, *poof* my reply is missing. Here is the mdstat.log file output. I am unable to interpret exactly what this says about my device, so intpretation help is appreciated~ Also included a screenshot of the Volume configuration screen via the GUI.

I rebooted the device today, the device booted up although it seemed to take slightly longer than usual. I saw a progress indicator on the front display, before the numbered disk display came up. Disk5 on the front display panel still shows empty.

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md2 : active raid5 sda3[0] sdf3[5](S) sdd3[3] sdc3[2] sdb3[1]
      2916120576 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      
md1 : active raid5 sda2[0] sdf2[6] sdd2[4] sdc2[2] sdb2[1]
      2621120 blocks super 1.2 level 5, 64k chunk, algorithm 2 [6/5] [UUUU_U]
      
md0 : active raid1 sda1[0] sdf1[6] sdd1[4] sdc1[2] sdb1[1]
      4193268 blocks super 1.2 [5/5] [UUUUU]
      
unused devices: <none>

StephenB · ‎2019-08-29

@WhySoManyUserna wrote:

Fourth attempt to reply to responses. For some reason my posts look like they are saved to this thread, but when I reload the page, *poof* my reply is missing.

The spam filter triggered. I released this reply (not the others to avoid clutter).

@WhySoManyUserna wrote:

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md2 : active raid5 sda3[0] sdf3[5](S) sdd3[3] sdc3[2] sdb3[1]
      2916120576 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      
md1 : active raid5 sda2[0] sdf2[6] sdd2[4] sdc2[2] sdb2[1]
      2621120 blocks super 1.2 level 5, 64k chunk, algorithm 2 [6/5] [UUUU_U]
      
md0 : active raid1 sda1[0] sdf1[6] sdd1[4] sdc1[2] sdb1[1]
      4193268 blocks super 1.2 [5/5] [UUUUU]
      
unused devices: <none>

After the reboot, the disks should be sda, sdb, sdc, sdd, sde, sdf. sde isn't shown in any of your volumes.

There is no data volume shown (which would be md127). The other three are used by the OS - md0 being the OS partition itself. There's a tmp volume (not sure exactly what it's used for), and a swap volume.

But there are some other oddities - md1 shows one of six disks missing from that volume (the [UUUU_U] bit). Presumably that is sde. md2 has only 4 disks - showing sdf as a spare. md0 looks ok (with no disk 5).

I don't what it takes to remount the data volume - some of the underlying issues with md1 and md2 might also have affected it. But it might have had both sde and sdf in it (since md2 apparently did).

WhySoManyUserna · ‎2019-08-29

Hmmmm thank you for the explanation, although I'm not sure what to actually do with your assessment~

As far as I can recollect, there is no tool or section in the GUI administration front-end to modifiy or attempt to fix the oddities you spotted?

I'm thinking what I will do is let my offsite back up sync to complete (regrettably I had just moved a large chunk of data to the NAS right before disk5 became problematic), remove disk5 and put a new disk in. If it all goes **bleep** up, I'll reformat everything disks 1-6, rebuild the box, and restore data from the offsite cloud. Hmmmmm might also take a snap shot of key data components to a 1TB USB drive I have laying around, just in case.

StephenB · ‎2019-08-29

@WhySoManyUserna wrote:

Hmmmm thank you for the explanation, although I'm not sure what to actually do with your assessment~

As far as I can recollect, there is no tool or section in the GUI administration front-end to modifiy or attempt to fix the oddities you spotted?

There is no tool in the GUI, and I'm not certain how to deal with the main issue (which is of course the data volume). @Sandshark's original concern (that some drives were marked as spares in at lease some RAID groups) appears to be confirmed. And we don't have any way to sort out exactly how the data volume was constructed.

@WhySoManyUserna wrote:

I'm thinking what I will do is let my offsite back up sync to complete (regrettably I had just moved a large chunk of data to the NAS right before disk5 became problematic), remove disk5 and put a new disk in. If it all goes **bleep** up, I'll reformat everything disks 1-6, rebuild the box, and restore data from the offsite cloud. Hmmmmm might also take a snap shot of key data components to a 1TB USB drive I have laying around, just in case.

My guess is you'll end up doing the factory reset and rebuild everything. I'd also suggest testing the disks in a Windows PC with vendor tools first (Lifeguard for Western Digital, Seatools for Seagate). It's not clear to me what happened, but disk errors on some of the disks might be part of the puzzle.

WhySoManyUserna · ‎2019-10-16

Follow up information. Using the WD Lifeguard disk diagnostic tool, I have the following results. Does this change your recovery procedure suggestions?

Disk 1 - Quick Test PASS, Extended test PASS

Disk 2 - Quick Test PASS, Extended test FAILED with error code "08 too many bad sectors detected". Strange the Quick scan succeeded.

Disk 3 - Quick Test PASS, Extended test PASS

Disk 4 - Quick Test PASS, Extended test PASS

Disk 5 - FAILED, expected

Disk 6 - Quick Test PASS, Extended test PASS

StephenB · ‎2019-10-16

@WhySoManyUserna wrote:

Disk 2 - Quick Test PASS, Extended test FAILED with error code "08 too many bad sectors detected". Strange the Quick scan succeeded.

Not really strange. The bad sectors weren't detected by the drive until they were read.

@WhySoManyUserna wrote:

Does this change your recovery procedure suggestions?

My suggestion is to do a factory reset with all the good drives in place, and restore data from backup.

If you have the time, perhaps run the full "Erase" test in lifeguard on the disks that passed. That can turn up issues that the non-destructive extended pass won't catch.

Sandshark · ‎2019-10-16

I assume this is a RAIDiator 4.2.x system. It's been a long time since I worked with a 4.2.x system, but I think md2 is the main (C) volume. It would be md127 under OS6.

That being the case, it looks like your data is currently intact, but may become unprotected if drive 2 fails, which is likely. Drive 5 was never a part of that array. Why md1 is a RAID5 in the first place, I'm not sure, but drive 5 was part of it and is now missing. md0 probably already did a re-sync without drive 5, since it's a multi-redundant RAID1.

Backing up and doing a factory default with current drives 1, 3,4 & 6 plus two new ones is your best solution. You can't replace drive 2 at this point because that will break volume md1 since it's a RAID5 and already missing one drive partition. Replacing drive 5 first will put everythng at risk because resync will stress drive 2, possibly to the breaking point. But it does only need to re-sync the small md1, so it could survive. As long as you have a back-up in case of disaster, you could try that, followed by drive 2.

Maybe somebody still working with a 4.2.x system can explain why md1 is a RAID5 at all and how to safely get it down to just the 5 good drives with redundancy so you can start by replacing drive 2 . I know how to get the RAID down to just the 5 drives, but don't know what will happen to the ext partition it contains when you do that.

StephenB · ‎2019-10-17

@Sandshark wrote:

I assume this is a RAIDiator 4.2.x system. It's been a long time since I worked with a 4.2.x system, but I think md2 is the main (C)

Maybe somebody still working with a 4.2.x system can explain why md1 is a RAID5 at all and how to safely get it down to just the 5 good drives with redundancy so you can start by replacing drive 2 .

md2 is the data volume (C).

md1 is the OS swap partition (which has no file system on it). It is RAID-6 on my Pro-6 system (which has 6 disks). Not sure why its RAID5 in @WhySoManyUserna's system, or what would happen if md1 fails.

I guess one option would be to try

replacing disk 5, and hope the resync works
replacing disk 2 if it does, and wait for resync
then hot-remove, hot insert disk 3 to attempt to get it added properly to md1.

But I'd personally do the destructive write test on the drives that passed the read test, and rebuild the array from scratch with working disks (w/o spares). If you do that, then consider converting the NAS to run OS 6 software as part of the process.

Procedure for responding to a disk failure

Procedure for responding to a disk failure

Re: Procedure for responding to a disk failure

Re: Procedure for responding to a disk failure

Re: Procedure for responding to a disk failure

Re: Procedure for responding to a disk failure

Re: Procedure for responding to a disk failure

Re: Procedure for responding to a disk failure

Re: Procedure for responding to a disk failure

Re: Procedure for responding to a disk failure

Re: Procedure for responding to a disk failure

Re: Procedure for responding to a disk failure