Volume scan failed to run properly #26188803

JimTho · ‎2015-12-17

I have a ReadyNAS Ultra 4 with four disks in X-RAID 2. One disk had several ATA errors and I replaced the old disk 1.5 TB with a new 4 TB disc. After reboot the NAS started with sync, and after about 1 hour I noticed in the web gui (RAIDiator) "Update finished. Please reboot the device". So, I rebooted the device. After the reboot I receive the error "Volume scan failed to run properly". I contacted support and got instruction in how to update the OS with this link: http://kb.netgear.com/app/answers/detail/a_id/21104/~/how-do-i-access-the-boot-menu-on-my-readynas-u... However, this did not solve the problem, and I have replied to my support ticket. However, I have search this site for solutions, and in all the posts that I have come across the response is "Submit a ticket" or "Contact support" and after that the threads are dead. So, I promise this forum to post the solutions to this problem if I am being given one that works as this seems to be a problem I am not the only one to have come across in the last few years. However, with so many posts addressing this problem I am stunned that there is not a good "guide" or approach available. I am willing to pay to get this fixed as I have 6-7 TB of data with little backup (due to two fails happening at the same time - I have another Ultra 4 as a backup unit that I removed files and folder on at the same time as replacing the disk on the main NAS, another story...). I would appreciate any help or input on this issue. Cheers, Jimmy

JimTho · ‎2016-01-30

Hallelujah!

I have now managed to get my data volume back up! Unfortunately, Netgear L2 tech did not manage and I had to do this by myself.

As promised I will give you the results. Note that I am not a trained IT-engineer, so I got this from Google search and dedication. If you decide to do this it is on your own risk - I take no responsibility that it will work on your system.

I had only 4 sata-ports on my PC, and had to install Linux on a USB in order to get all 4 disks connected. I wanted to use Knoppix Linux, but I am sure most Linux versions would do.

Get Linux installed on the computer:

I installed Knoppix Linux to a USB-stick. This was not trivial as I had a new Z170 motherboard and a regular USB-boot would not work, using Universial-USB-installer-1.9.6.3 or unetbootin-windows-613. I ended up attaching a sata DVD and burned a Knoppix DVD, booted from the DVD and installed Knoppix on the USB.

Removed the DVD-RW drive and attached all the 4 NAS-drives to the PC. Booted up in Knoppix (USB) and started to see if I could access the drives. I noticed that Knoppix displayed one mounted volume and a few volumes that was not mounted (these volumes turned out to be LVM physical volumes). I could access the files on the mounted RAID volume which turned out to be the NAS OS.

I performed several different commands after some Google search, being careful not to run anything that could make changes to the drive, in case I needed to perform some data recovery.

First I checked the partition tables on all four drives using gdisk:

knoppix@Microknoppix:~$ sudo gdisk /dev/sda

They were identical, not shown as I did this in four different windows.

Then I wanted to checked the raid setup and ran:

knoppix@Microknoppix:~$ sudo mdadm --detail --scan

ARRAY /dev/md/4 metadata=1.2 name=A021B7C18D0C:4 UUID=d6301b60:0ce2f767:558c574f:db007ccb

ARRAY /dev/md/1 metadata=1.2 name=A021B7C18D0C:1 UUID=d2791ec8:5adda84e:c7463c2e:c0f2016b

ARRAY /dev/md/0 metadata=1.2 name=A021B7C18D0C:0 UUID=a218f0a3:1b607e2e:953b087b:04ed9c99

INACTIVE-ARRAY /dev/md3 metadata=1.2 name=A021B7C18D0C:3 UUID=5aa62eb3:fa4e39b8:213486da:d587542d

ARRAY /dev/md/2 metadata=1.2 name=A021B7C18D0C:2 UUID=829ccffc:55683ba6:36bb7959:6eed3523

From this I figured out there was an inactive array md3.

Then I used e2fsck to check the partition:

knoppix@Microknoppix:~$ e2fsck /dev/md3

e2fsck 1.42.13 (17-May-2015)

e2fsck: Invalid argument while trying to open /dev/md3

 

The superblock could not be read or does not describe a valid ext2/ext3/ext4

filesystem. If the device is valid and it really contains an ext2/ext3/ext4

filesystem (and not swap or ufs or something else), then the superblock

is corrupt, and you might try running e2fsck with an alternate superblock:

   e2fsck -b 8193 <device>

or

   e2fsck -b 32768 <device>

This made me think there was a problem with the superblocks on the partitions, that turned out not to be important. Searching and looking for answers I decided to stop the array and start it again:

knoppix@Microknoppix:~$ sudo mdadm --stop --scan

knoppix@Microknoppix:~$ sudo mdadm --assemble --scan

mdadm: /dev/md/4 has been started with 2 drives (out of 3).

mdadm: restoring critical section

mdadm: /dev/md/3 has been started with 4 drives.

mdadm: /dev/md/2 has been started with 4 drives.

mdadm: /dev/md/1 has been started with 4 drives.

mdadm: /dev/md/0 has been started with 4 drives.

mdadm: Found some drive for an array that is already active: /dev/md/4

mdadm: giving up.

Then used lvmdiskscan to see if I could see the volumes and if there was a problem with any of them :

knoppix@Microknoppix:~$ sudo lvmdiskscan

/run/lvm/lvmetad.socket: connect failed: No such file or directory

WARNING: Failed to connect to lvmetad. Falling back to internal scanning.

/dev/ram0 [       4.00 MiB]

/dev/md0   [       4.00 GiB]

/dev/ram1 [       4.00 MiB]

/dev/md1   [   1023.88 MiB]

/dev/ram2 [       4.00 MiB]

/dev/md2   [       4.08 TiB] LVM physical volume

/dev/ram3 [       4.00 MiB]

/dev/md3   [     931.50 GiB] LVM physical volume

/dev/ram4 [       4.00 MiB]

/dev/md4   [       3.64 TiB] LVM physical volume

/dev/ram5 [       4.00 MiB]

/dev/ram6 [       4.00 MiB]

/dev/ram7 [       4.00 MiB]

/dev/ram8 [       4.00 MiB]

/dev/ram9 [       4.00 MiB]

/dev/ram10 [       4.00 MiB]

/dev/ram11 [       4.00 MiB]

/dev/ram12 [       4.00 MiB]

/dev/ram13 [       4.00 MiB]

/dev/ram14 [       4.00 MiB]

/dev/ram15 [       4.00 MiB]

/dev/sde1 [       4.46 GiB]

/dev/sde2 [     24.82 GiB]

/dev/sdf1 [       4.46 GiB]

0 disks

21 partitions

0 LVM physical volume whole disks

3 LVM physical volumes

There was 3 volumes listed. Followed up with lvdisplay to see the logical volume:

knoppix@Microknoppix:~$ sudo lvdisplay

/run/lvm/lvmetad.socket: connect failed: No such file or directory

WARNING: Failed to connect to lvmetad. Falling back to internal scanning.

--- Logical volume ---

LV Path               /dev/c/c

LV Name               c

VG Name               c

LV UUID               DHaiSO-OE5j-wbTe-rW1L-Zh1L-DNFP-vbPjvA

LV Write Access       read/write

LV Creation host, time ,

LV Status             NOT available

LV Size               6.80 TiB

Current LE             111404

Segments               3

Allocation             inherit

Read ahead sectors     auto

From here I assumed the volume c was not available. Followed up with lvscan:

knoppix@Microknoppix:~$ sudo lvscan

/run/lvm/lvmetad.socket: connect failed: No such file or directory

WARNING: Failed to connect to lvmetad. Falling back to internal scanning.

inactive         '/dev/c/c' [6.80 TiB] inherit

Hmm. The data volume (c) was inactive. Now, I had previously tried to activate the array using mdadm --detail --scan. I searched the web further and came across this site/post that solved the case: http://pissedoffadmins.com/os/mount-unknown-filesystem-type-lvm2_member.html

knoppix@Microknoppix:~$ modprobe dm-mod

knoppix@Microknoppix:~$ sudo vgchange -ay

/run/lvm/lvmetad.socket: connect failed: No such file or directory

WARNING: Failed to connect to lvmetad. Falling back to internal scanning.

1 logical volume(s) in volume group "c" now active

Voila! The volume came up and I then managed to mount it! I put all the disks back in the Netgear NAS and it booted normally. I am now transferring files to the other backup Netgear NAS as we speak. I guess this will take a bit. Also the 4th disk is now resyncing.

Sat Jan 30 17:04:37 CET 2016 System is up.

Sat Jan 30 17:04:37 CET 2016 Volume C is approaching capacity: 88% used 878G available

Sun Jan 17 12:15:59 CET 2016 System is up.

Sun Jan 17 12:15:59 CET 2016 The paths for the shares listed below could not be found. Typically, this occurs when the ReadyNAS is unable to access the data volume. Squeezeboxserver Documents Video media Photos Music

Sun Jan 17 12:15:41 CET 2016 Volume scan failed to run properly.

I hope this can be useful for others, including the L2 Netgear support, which in my opinion should have been able to address this issue in the first place. Not letting me go searching around the web for possible solutions. If I am able to figure this out (though I have a PhD in genetics, and have been around computers for 25 years) an engineer at Netgear definitely should have fixed this easily. This in my point qualify for a refund! Also, that Netgear does not log their service to provide proof/documentation of their work is surprising.

I am happy I figured it out, and hope this can be useful for someone else in a similar situation.

View solution in original post

StephenB · ‎2015-12-17

Did you try the boot menu option to "skip volume check"?

JimTho · ‎2015-12-17

Thank you StephenB for your suggestion. Unfortunately, I still have no volume on the NAS. I have tried the following with no solution to the problem: Replace the old (dead) drive and restart the NAS. Remove the faulty drive and run with only 3 drives (drive 4 removed) Replaced the new updated firmware with the previous one (that was working before the "crash") Rebooted with the option of "reinstall OS" from the boot menu as suggested by Netgear Tech Support. Rebooted with the option of "skip volume check" from boot menu. I assume I have to pay Netgear using their "Data recovery contract", a 155 Euro no guarantee attempt to fix the problem using Telnet. I will contact them tomorrow. If this solves the problem I am not sure I have access to what the Netgear Tech performed on the unit. Hence, I will not be able to provide any help for other users. Anyway, I will post the outcome here.

StephenB · ‎2015-12-17

You can potentially start with per-incident support (the data might be intact if they can mount the volume). That is cheaper than data recovery.

JimTho · ‎2015-12-23

Ok, time for an update.

Tech support has come back to me and asked for the log files, which I have attached to the case.

After looking at it I have been asked to remove the drive in slot 4 and the volume should be available. Unfortunately, I have tried that already, and this is not solving the problem. I hope that my 1 hour support is not spent without someone logging into the NAS.

I have also removed the three functional drives from the NAS, attached them to my computer, and run "ReclaiMe" software. All files/folders seems to have been "identified" in the software. I can only recover the files if I purchase the software ($200).

Hypothesis:

During the sync and resizing of the volume, the NAS was rebooted and FW was automatically updated, possibly causing the start and end blocks of each volumes on the drives being alterd (LVM header?) due to volume expansion, and these changes not being updated with mdadm (this might be speculative as I am not fully aware of the details here).

Would it be possible to use a software to scan the drives, identify the start-end blocks, verify these with LVM header/mdadm to see if they are the same? If not, make changes so mdadm, or somewhere else, so that the volume will be identified again in my NAS?

Alternative solutions:

Are there any Linux software that can be used to scan three out of four disks in an X-RAID 2 to obtain the files? If not, am I only stuck with Level 3 Netgear Tech Support (~€150) or a commercial Windows software e.g. ReclaiMe at $200?

I will keep updating this thread until I have resolved the case. Again, thanks for any suggestions you might have.

JimTho · ‎2015-12-29

Update: 29.12.2015

Tech support has asked me to boot the NAS into "tech support mode" this afternoon. I assume this opens up the NAS to the standard "tech support mode" root login? I feel a little bit worried to leave the unit with this open access for too long. I hope support will be using the "tech support mode" during the day tomorrow. Does anyone know how long I will have to leave the NAS in this mode? I have been waiting since the 18th of December after I paid for Tech Level 2 support, and we are only a few days away from New Year. I understand that Tech Support also has Christmas Holidays so I am not complaining here - just have had no access to my files for more than 12 days now.

I will ask tech support to provide me with their log file after they have accessed the NAS. I assume I will be given all the logs for the session they have while it is in "tech support mode"? Anyone gotten log files after a remote session?

Or, will the NAS store all communications during the remote session for me to use for later troubleshooting?

I think the log is useful for me because:

1. I have documentation of what has been done to the unit.
2. What did work or what did not work.
3. What will not be needed to try out in the future if Level 2 tech support does not work.

Well, I will keep the forum updated with the development.

StephenB · ‎2015-12-31

tech support mode would allow telnet access on your local LAN, but that is normally blocked by your router over the internet.

It also allows remote access by Netgear (though they need an access code that you provide them in order to do that)

JimTho · ‎2015-12-31

@StephenB wrote:
tech support mode would allow telnet access on your local LAN, but that is normally blocked by your router over the internet.

It also allows remote access by Netgear (though they need an access code that you provide them in order to do that)

Ok, but I am sure someone can exploit this, and that is why I am reluctant in keeping it in this mode for too long.

I have not yet been told to give them a code, so I guess they will not access it this year ...

JimTho · ‎2016-01-14

I promised to keep this post updated, but I am sorry to inform that I do not have much more information to update with yet.

Tech Support has not asked for an "access code" but have accessed the NAS without.

The NAS seems to still be in sync process after 9 days, at least that is the last response I got from Tech Support.

The sync process seems to be very very slow process, but I will keep this post updated when I get more information. Tech Support is not sure if the data is ok yet, I guess they will have to wait for the sync process to finish. Myself, I am puzzled that the sync process can take this long, but I have no possibility to monitor the status myself as the NAS is still in "debug mode" and has been so for the last 15-16 days.

JimTho · ‎2016-01-17

UPDATE!

I got feedback from Tech Support the 15. of January:

Tech Support wrote 15. january 2016

Hello Jim,

I spoke to my colleague who was working on the NAS.
He explains the issue seems to either be caused at the time the one disk failed, or during the sync when the larger disk was used as a replacement.
It may have tried to reshape the RAID and this has caused the issue. At present it is not something we can fix at level 2.

To escalate this issue to level 3 you would need to purchase a data recovery contract.
The contract is around 155euro.

But it is not clear if the data can be recovered. It is hard to estimate but my colleague who looked at the RAID said he thinks there is about a 50/50% chance.
Please understand its very hard to estimate, even with the data recovery and level 3 involvement there is a chance no data can be recovered.
Due to the nature of disks, raid and storage in general it is not something to which a guarantee can be given.

Let me know if you are interested in the data recovery contract or if you have any questions.

The NAS is still missing the volume. How could L2 tech support start a sync process if the volume was not fixed?

I have connected the drives back to my computer and run the recovery software today. ALL FILES are GONE! The drives can not be assembled in ReclaiMe. I have also tried "NAS data recovery" from Runtime to no awail. I am so utterly upset now that I can hardly sit still. Almost 7 TB of data seems to be gone after Netgear Support L2 initiated ("forced"?) sync on the NAS. I should have recovered the data while I had the chance using the recovery software, but I put the trust in tech support. Now, tech support suggest to pay 155 euro for L3 to have a look at it.

I have requested all log files to be handed over to me as agreed upon with L2 support when I accepted the terms, all SSH commands/script used during the 15+ days it was in "debug mode". Maybe there is something that can be fixed after I get this data.

Anyone with suggestions of what to do next?

JimTho · ‎2016-01-17

Just got hold of some of the log files from the NAS. Interestingly it seems that the raid.conf file content has changed after L2 involvement:

Old file content raid.conf:

/dev/md0,root!!number=0,chan=0,dev=/dev/sda1,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=0!!number=1,chan=1,dev=/dev/sdb1,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=1!!number=2,chan=2,dev=/dev/sdc1,model=WDC WD20EARS-00MVWB0,sectors=3907027054,raid_disk=2

/dev/md1,swap!!number=0,chan=0,dev=/dev/sda2,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=0!!number=1,chan=1,dev=/dev/sdb2,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=1!!number=2,chan=2,dev=/dev/sdc2,model=WDC WD20EARS-00MVWB0,sectors=3907027054,raid_disk=2!!number=4,chan=3,dev=/dev/sdd2,model=Seagate ST4000VN000-1H4168,sectors=7814037168,raid_disk=3

/dev/md2,C!!number=0,chan=0,dev=/dev/sda3,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=0!!number=1,chan=1,dev=/dev/sdb3,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=1!!number=2,chan=2,dev=/dev/sdc3,model=WDC WD20EARS-00MVWB0,sectors=3907027054,raid_disk=2!!number=3,chan=3,dev=/dev/sdd3,model=Seagate ST31500341AS,sectors=2930275054,raid_disk=3

/dev/md4,swap!!number=0,chan=0,dev=/dev/sda5,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=0!!number=1,chan=1,dev=/dev/sdb5,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=1!!number=2,chan=3,dev=/dev/sdd5,model=Seagate ST4000VN000-1H4168,sectors=7814037168,raid_disk=2

New file content raid.conf:

/dev/md0,root!!number=0,chan=0,dev=/dev/sda1,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=0!!number=1,chan=1,dev=/dev/sdb1,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=1!!number=2,chan=2,dev=/dev/sdc1,model=WDC WD20EARS-00MVWB0,sectors=3907027054,raid_disk=2!!number=4,chan=3,dev=/dev/sdd1,model=Seagate ST4000VN000-1H4168,sectors=7814037168,raid_disk=3

/dev/md1,swap!!number=0,chan=0,dev=/dev/sda2,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=0!!number=1,chan=1,dev=/dev/sdb2,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=1!!number=2,chan=2,dev=/dev/sdc2,model=WDC WD20EARS-00MVWB0,sectors=3907027054,raid_disk=2!!number=4,chan=3,dev=/dev/sdd2,model=Seagate ST4000VN000-1H4168,sectors=7814037168,raid_disk=3

/dev/md2,swap!!number=0,chan=0,dev=/dev/sda3,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=0!!number=1,chan=1,dev=/dev/sdb3,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=1!!number=2,chan=2,dev=/dev/sdc3,model=WDC WD20EARS-00MVWB0,sectors=3907027054,raid_disk=2!!number=4,chan=3,dev=/dev/sdd3,model=Seagate ST4000VN000-1H4168,sectors=7814037168,raid_disk=3

/dev/md4,swap!!number=0,chan=0,dev=/dev/sda5,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=0!!number=1,chan=1,dev=/dev/sdb5,model=WDC WD40EFRX-68WT0N0,sectors=7814037168,raid_disk=1!!number=2,chan=3,dev=/dev/sdd5,model=Seagate ST4000VN000-1H4168,sectors=7814037168,raid_disk=2

I noticed in the new raid.conf there is no /dev/md*,C!

Should it be present? If so, why is it no longer on sda3?

Noticed also that the reference to the model name is wrong in the new config file, in bold, last part of the lines. Not sure if this is of importance though.

StephenB · ‎2016-01-17

I don't see anything in the note that says tech support initiated the resync.

Why are you thinking that they did?

JimTho · ‎2016-01-17

Well, I was told they managed to start a sync:

L2 Tech wrote the 6th of January

Hello Jim,

We have had access and have made some changes.
It looks like the last disk did not sync correctly, this is being done manually.

There is a sync in progress at the moment and this will take a bit of time to complete.
Please leave the NAS in tech support for now and I will update you as we progress.

When I got this information I assumed that they managed to get the Volume up on the NAS and started the sync. So I was thrilled. It seems now that they manually did a sync which deleted all the 7 TB of data. But, I am awaiting a response from the Netgear Tech Support to confirm this. I will go through the SSH logs when they are sent to me. I am not very optimistic about getting my data back, which leaves me rather upset.

StephenB · ‎2016-01-18

@JimTho wrote:

I am not very optimistic about getting my data back, which leaves me rather upset.

Understandably.

Syncing the 4 TB drive to the rest of the array wouldn't change what's on the other disks - all it should do is reconstruct what was on the drive you replaced. So I think there has to be more to this story; hopefully you can sort it out at least.

JimTho · ‎2016-01-18

I agree with you there Stephen. However, if this was done manually and the wrong disk was chosen to be synched with the rest ...

Still not heard anything from Netgear, which makes me a bit nervous. I hope to get the promised SSH logs to clear things up.

StephenB · ‎2016-01-18

@JimTho wrote:

However, if ... the wrong disk was chosen to be synched with the rest ...

That would be very bad, yes.

JimTho · ‎2016-01-20

Update.

I just got a message from L2 tech today and they would like to "re-asses" the NAS again.

I have again asked for the promised SSH logs from the L2 technician to further understand what has happened.

My understanding is that the problem with the volume failing (volume scan failed to run), or be defined by the OS, was not addressed. Instead a manual sync was performed. Now, given that there were some problems for the OS on the NAS to properly identify the volume, maybe due to wrong mapping of partitions, how can it be safe to perform a manual sync?

I am waiting for a response, and must say that I am dissapointed now about the time it takes to get this sorted out.

JimTho · ‎2016-01-27

Netgear Support has not got back to me regarding the SSH log files as requested. Another 7 days with no good news. Not sure if this is normal, anyway bad sign for my case.

I have started to search for answers and have found some possible solutions/things to check out.

The intention now is to perform diagnostic "tests" within the next few days.

Too bad I am not able to get information from Netgear of what they have done while the Unit was in "debug mode" as this could better help me to address the situation/troubleshooting.

I hope for good news when I start the troubleshooting myself.

JimTho · ‎2016-01-30

Hallelujah!

I have now managed to get my data volume back up! Unfortunately, Netgear L2 tech did not manage and I had to do this by myself.

As promised I will give you the results. Note that I am not a trained IT-engineer, so I got this from Google search and dedication. If you decide to do this it is on your own risk - I take no responsibility that it will work on your system.

I had only 4 sata-ports on my PC, and had to install Linux on a USB in order to get all 4 disks connected. I wanted to use Knoppix Linux, but I am sure most Linux versions would do.

Get Linux installed on the computer:

I installed Knoppix Linux to a USB-stick. This was not trivial as I had a new Z170 motherboard and a regular USB-boot would not work, using Universial-USB-installer-1.9.6.3 or unetbootin-windows-613. I ended up attaching a sata DVD and burned a Knoppix DVD, booted from the DVD and installed Knoppix on the USB.

Removed the DVD-RW drive and attached all the 4 NAS-drives to the PC. Booted up in Knoppix (USB) and started to see if I could access the drives. I noticed that Knoppix displayed one mounted volume and a few volumes that was not mounted (these volumes turned out to be LVM physical volumes). I could access the files on the mounted RAID volume which turned out to be the NAS OS.

I performed several different commands after some Google search, being careful not to run anything that could make changes to the drive, in case I needed to perform some data recovery.

First I checked the partition tables on all four drives using gdisk:

knoppix@Microknoppix:~$ sudo gdisk /dev/sda

They were identical, not shown as I did this in four different windows.

Then I wanted to checked the raid setup and ran:

knoppix@Microknoppix:~$ sudo mdadm --detail --scan

ARRAY /dev/md/4 metadata=1.2 name=A021B7C18D0C:4 UUID=d6301b60:0ce2f767:558c574f:db007ccb

ARRAY /dev/md/1 metadata=1.2 name=A021B7C18D0C:1 UUID=d2791ec8:5adda84e:c7463c2e:c0f2016b

ARRAY /dev/md/0 metadata=1.2 name=A021B7C18D0C:0 UUID=a218f0a3:1b607e2e:953b087b:04ed9c99

INACTIVE-ARRAY /dev/md3 metadata=1.2 name=A021B7C18D0C:3 UUID=5aa62eb3:fa4e39b8:213486da:d587542d

ARRAY /dev/md/2 metadata=1.2 name=A021B7C18D0C:2 UUID=829ccffc:55683ba6:36bb7959:6eed3523

From this I figured out there was an inactive array md3.

Then I used e2fsck to check the partition:

knoppix@Microknoppix:~$ e2fsck /dev/md3

e2fsck 1.42.13 (17-May-2015)

e2fsck: Invalid argument while trying to open /dev/md3

 

The superblock could not be read or does not describe a valid ext2/ext3/ext4

filesystem. If the device is valid and it really contains an ext2/ext3/ext4

filesystem (and not swap or ufs or something else), then the superblock

is corrupt, and you might try running e2fsck with an alternate superblock:

   e2fsck -b 8193 <device>

or

   e2fsck -b 32768 <device>

This made me think there was a problem with the superblocks on the partitions, that turned out not to be important. Searching and looking for answers I decided to stop the array and start it again:

knoppix@Microknoppix:~$ sudo mdadm --stop --scan

knoppix@Microknoppix:~$ sudo mdadm --assemble --scan

mdadm: /dev/md/4 has been started with 2 drives (out of 3).

mdadm: restoring critical section

mdadm: /dev/md/3 has been started with 4 drives.

mdadm: /dev/md/2 has been started with 4 drives.

mdadm: /dev/md/1 has been started with 4 drives.

mdadm: /dev/md/0 has been started with 4 drives.

mdadm: Found some drive for an array that is already active: /dev/md/4

mdadm: giving up.

Then used lvmdiskscan to see if I could see the volumes and if there was a problem with any of them :

knoppix@Microknoppix:~$ sudo lvmdiskscan

/run/lvm/lvmetad.socket: connect failed: No such file or directory

WARNING: Failed to connect to lvmetad. Falling back to internal scanning.

/dev/ram0 [       4.00 MiB]

/dev/md0   [       4.00 GiB]

/dev/ram1 [       4.00 MiB]

/dev/md1   [   1023.88 MiB]

/dev/ram2 [       4.00 MiB]

/dev/md2   [       4.08 TiB] LVM physical volume

/dev/ram3 [       4.00 MiB]

/dev/md3   [     931.50 GiB] LVM physical volume

/dev/ram4 [       4.00 MiB]

/dev/md4   [       3.64 TiB] LVM physical volume

/dev/ram5 [       4.00 MiB]

/dev/ram6 [       4.00 MiB]

/dev/ram7 [       4.00 MiB]

/dev/ram8 [       4.00 MiB]

/dev/ram9 [       4.00 MiB]

/dev/ram10 [       4.00 MiB]

/dev/ram11 [       4.00 MiB]

/dev/ram12 [       4.00 MiB]

/dev/ram13 [       4.00 MiB]

/dev/ram14 [       4.00 MiB]

/dev/ram15 [       4.00 MiB]

/dev/sde1 [       4.46 GiB]

/dev/sde2 [     24.82 GiB]

/dev/sdf1 [       4.46 GiB]

0 disks

21 partitions

0 LVM physical volume whole disks

3 LVM physical volumes

There was 3 volumes listed. Followed up with lvdisplay to see the logical volume:

knoppix@Microknoppix:~$ sudo lvdisplay

/run/lvm/lvmetad.socket: connect failed: No such file or directory

WARNING: Failed to connect to lvmetad. Falling back to internal scanning.

--- Logical volume ---

LV Path               /dev/c/c

LV Name               c

VG Name               c

LV UUID               DHaiSO-OE5j-wbTe-rW1L-Zh1L-DNFP-vbPjvA

LV Write Access       read/write

LV Creation host, time ,

LV Status             NOT available

LV Size               6.80 TiB

Current LE             111404

Segments               3

Allocation             inherit

Read ahead sectors     auto

From here I assumed the volume c was not available. Followed up with lvscan:

knoppix@Microknoppix:~$ sudo lvscan

/run/lvm/lvmetad.socket: connect failed: No such file or directory

WARNING: Failed to connect to lvmetad. Falling back to internal scanning.

inactive         '/dev/c/c' [6.80 TiB] inherit

Hmm. The data volume (c) was inactive. Now, I had previously tried to activate the array using mdadm --detail --scan. I searched the web further and came across this site/post that solved the case: http://pissedoffadmins.com/os/mount-unknown-filesystem-type-lvm2_member.html

knoppix@Microknoppix:~$ modprobe dm-mod

knoppix@Microknoppix:~$ sudo vgchange -ay

/run/lvm/lvmetad.socket: connect failed: No such file or directory

WARNING: Failed to connect to lvmetad. Falling back to internal scanning.

1 logical volume(s) in volume group "c" now active

Voila! The volume came up and I then managed to mount it! I put all the disks back in the Netgear NAS and it booted normally. I am now transferring files to the other backup Netgear NAS as we speak. I guess this will take a bit. Also the 4th disk is now resyncing.

Sat Jan 30 17:04:37 CET 2016 System is up.

Sat Jan 30 17:04:37 CET 2016 Volume C is approaching capacity: 88% used 878G available

Sun Jan 17 12:15:59 CET 2016 System is up.

Sun Jan 17 12:15:59 CET 2016 The paths for the shares listed below could not be found. Typically, this occurs when the ReadyNAS is unable to access the data volume. Squeezeboxserver Documents Video media Photos Music

Sun Jan 17 12:15:41 CET 2016 Volume scan failed to run properly.

I hope this can be useful for others, including the L2 Netgear support, which in my opinion should have been able to address this issue in the first place. Not letting me go searching around the web for possible solutions. If I am able to figure this out (though I have a PhD in genetics, and have been around computers for 25 years) an engineer at Netgear definitely should have fixed this easily. This in my point qualify for a refund! Also, that Netgear does not log their service to provide proof/documentation of their work is surprising.

I am happy I figured it out, and hope this can be useful for someone else in a similar situation.

mdgm-ntgr · ‎2016-01-31

You purchased a per incident support contract and support advised they would need a data recovery contract in place to escalate this to L3 the day after the case was opened. When you declined to purchase it they continued to provide what support they could.

Data recovery attempts inherently may be unsuccessful. It's important to backup your data if you value it. No important data should be stored on just the one device, no matter which device that is. See Preventing Catastrophic Data Loss

L3 support handles data recovery cases. It is not something which L2 is trained to perform. If you don't know what you're doing when attempting data recovery you can make things worse. I can see some strange commands in your list of commands I see you attempted to run a filesystem check on one of the raid layers even though there is a layer between the raid layers and the filesystem. In this case that command shouldn't have done any damage as it recognised that their wasn't an EXT filesystem directly on the raid layer, but with other commands you can do damage.

Every case is different and what "worked" for you may not be appropriate for others. Indeed it's possible from the list of commands that you provided that you could make the problem worse.

With problems like this one needs to carefully identify why one of the RAID layers failed to start and which of the disks to try and bring online. If you bring the wrong disks online then you can cause problems.

One needs to examine why md3 failed to start and whether to leave out the partition on one of the disks when starting the array or not.

You've had some expansion on this system and have a triple layer array which can complicate things considerably when it comes to data recovery.

Now you may get fortunate and find that blindly entering commands you've found on the web works fine, but then again you may not.

Your logs attached to the case show that one of your disks failed very badly.

If not in April, certainly by the alert in September it should have been clear that the disk needed replacing or at the very least that it was advisable to update your backup. A common cause of getting into data recovery situations is leaving a disk in the NAS that needs replacing for a long time after it should be replaced.

JimTho · ‎2016-02-01

Thank you mdgm for making time to comment this thread. I honestly appreciate what you and others do by participating in the discussion forum. Further, I respect your opinion and can live with the fact that we might not see things the same way. I have shared my experience, the way I experienced it. You might criticize me, which you of course are allowed to do. Given this is the first comment to this thread something like “I am happy to see that you managed to get you data back” from you would warm my heart a bit before you start a rather long criticism of me and my wrong doing. No hard feelings ...

@mdgm wrote:
You purchased a per incident support contract and support advised they would need a data recovery contract in place to escalate this to L3 the day after the case was opened. When you declined to purchase it they continued to provide what support they could.

I honestly did not know what type of support existed when I called Netgear. It was suggested by a forum member to check with Support if a “pay-per incident” would be appropriate – which I did over the phone. Netgear Tech Support approved over the phone that this would be a good first go at the problem before they charged my credit card. Again I appreciate that you take time to go through the correspondence, but maybe you do not know the story as well as I do. Remember that I have been part of this experience since mid-December. Besides, even if we indeed were equally involved in the case we might end up with a different experience – which is a normal human response. However, I do not accept that I “declined” L3 support as Tech Support agreed that a “pay-per incident” would be a good start. If not, I would not pay 60 euro for nothing.

Data recovery attempts inherently may be unsuccessful. It's important to backup your data if you value it. No important data should be stored on just the one device, no matter which device that is. See Preventing Catastrophic Data Loss

Absolutely, and I cannot agree more. That is why I had a dedicated NAS (identical), and still has, to perform backup using rsync. If you read carefully, you would see that I indeed had that in place. Also, some of my “very” important data had a backup on the cloud (not mentioned as I did not see it very relevant at the time).

L3 support handles data recovery cases. It is not something which L2 is trained to perform. If you don't know what you're doing when attempting data recovery you can make things worse. I can see some strange commands in your list of commands I see you attempted to run a filesystem check on one of the raid layers even though there is a layer between the raid layers and the filesystem. In this case that command shouldn't have done any damage as it recognised that their wasn't an EXT filesystem directly on the raid layer, but with other commands you can do damage.

Again, thank you for bringing this up. I believe you are right. I have had the intention while posting on the board to be transparent – my intention is to get feedback to help me and others. And, you have made me aware that using the e2fsck command is not trivial and should be avoided in this case.

What still puzzles me, and maybe you can clear that up for me is what are L2 engineers trained to do? I was told that they could remotly log in make changes to see if they could get the volume up. Is this wrong?

As you know, since you have gone through the correspondence, L2 engineer has synced the partitions from one disk in the array to a new disk. The new disk was a part of an expanding array. In your opinion that is a job that L2 is qualified to do, right? However, it is not expected that they should be able to run lvdisplay and lvscan to investigate a problem with the volume? Followed up with running modprobe and the vgchange? You see, I do not know these things – and I believe my disappointment with Netgear Support, and don’t take this personally, comes down to communication. From my communication with Tech Support over the phone and in writing I was under the impression that the above diagnostic could be expected from L2 – hence I paid and now I am disappointed. Do you think it is wrong to express this disappointment on the forum?

Now you may get fortunate and find that blindly entering commands you've found on the web works fine, but then again you may not.

Now, I am sure you do not suggest that I blindly entered commands and by chance managed to get the volume back up? If so what do you think would be the odds for that to have a happy ending?

Your logs attached to the case show that one of your disks failed very badly.

If not in April, certainly by the alert in September it should have been clear that the disk needed replacing or at the very least that it was advisable to update your backup. A common cause of getting into data recovery situations is leaving a disk in the NAS that needs replacing for a long time after it should be replaced.

It is ok to look back and be wise – sometimes I do that. However, I did not know of the problems with the disks in April and not in September. I should, but i did not. When I did find out, that’s when the real trouble started, right? Again, this is something I learned from. I did not have email warning, which I should. I learned from this incident, which is a philosophy of mine. I make mistakes, and I learn from them. I gladly share my mistakes for others to learn from them, even though it might make me look silly.

Finishing your comment with what I failed to do to put myself in this situation in the first place - is hardly constructive. Remember, I managed to get my volume up and save my 6.8 TB of data something L2 Tech did not. TATA - a celebration for me! Even if you feel that everything is my fault with the outcome in this case it still does not change the fact that I did indeed have a bad experience with Tech Support. I believe that this comes down to communication and my expectations as a paying customer. My expectations, which I got from my dialog with Tech Support, were not met. I have learned from this, hopefully so did Tech Support. And maybe also some of the many forum members, if they visit this thread.

StephenB · ‎2016-02-01

@JimTho wrote:

I managed to get my volume up and save my 6.8 TB of data.

And of course that is great news.

@JimTho wrote:

I have learned from this, hopefully so did Tech Support.

My personal experience with tech support is pretty limited. I had a PSU fail on my NV+ last year, and tech support dealt with that swiftly and professionally. Some Netgear folks in the forum have also helped me directly from time to time (particularly with beta hardware and software). That includes both mdgm and skywalker (and probably others over the years).

My overall impression from posts here is that most people who do use tech support are happy with the outcome. But that is not to say that there isn't room for improvement.

I think overall your lessons from this are pretty balanced - some improvements in your own practices, and some areas where you think Netgear should have done better.

mdgm-ntgr · ‎2016-02-01

Some of the advice I have to provide is general in nature and for the benefit of others who may come across the thread. It is certainly commendable that you were able to get back your data, but if others follow the steps you did they may not be so fortunate. Data recovery you attempt yourself is done at your own risk and may reduce the chances of a professional data recovery attempt being successful.

Now, I understand the problem was not resolved by support, but any support we provide is on a best effort basis. There is never any guarantee that the problem will be resolved or on how long it will take. The unit was out of warranty so if you had not purchased support then support would have proceeded to close the case rather than provide the support they did. The cloning of the partition table is an example of work that we did for you. You paid for support and it was provided.

Pay Per Incident is designed for dealing with a one off problem and does not cover data recovery. Now when per incident support is purchased it may/may not be immediately obvious whether it will lead to data recovery. If soon after per incident support is purchased it is determined that data recovery is needed and you promptly purchase that then we may be able to refund per incident support, but in this case there was quite a bit of support done under per incident.

I understand that you see it differently. You could ask support to raise your request for a refund with customer service, who could then review your case and come to their own conclusion.

Once it is determined that data recovery contract is needed then the case can't be escalated to L3 for troubleshooting until that contract is purchased.

Before you had the call where you purchased the per incident contract I see you were emailed and advised to purchase a data recovery contract.

Well you tried running e2fsck on the raid layer, whereas if you run that it has to be run on the filesystem. Best to run it read-only too at first as a filesystem check may not be advisable in some instances.

If you have an up to date backup then in cases like this an alternative would be to wipe the NAS and restore from backup rather than purchase support. Data recovery is there for those who don't value their data enough to store it on multiple devices at all times. At about $200USD an hour work performed (there is a Euro price too, not sure what that is) it does cost a bit, but then data recovery requires a lot of highly skilled work.

L2's may be able to do some basic troubleshooting using SSH and fix some problems that L3 has told them how to fix, but more complex problems such as data recovery is for L3s.

Reviewing the case notes the system was determined to be in the middle of expansion and that data recovery would be needed. One of the RAID layers wasn't coming online and it would require some investigation to determine how best to proceed to try and bring it online as safely as possible.

It's my experience that Advanced L2s will try to do what they can to resolve a problem and move the case forward but they know once they reach their limits.

Some of the commands you entered showed a lack of knowledge as to how things work (e.g. trying to run a filesystem check on a RAID layer) and also if your list is a complete list then checking to see if it was safe to try and bring up md3 wasn't done either. You may have got out of this fine this time, but others doing the same may not be so fortunate and as this is a public community I have to allow for the fact that others are likely to see this thread.

It's best not to learn how we do RAID, LVM etc. when doing data recovery. It's far better to power down, remove your disks (label order) and put some scratch disks in (must not be from your array) and setup a new volume and then power down, remove the scratch disks (label order) and experiment on the array that uses the scratch disks and doesn't matter, if you must.

I was pointing out that the problem could have been avoided in the first place if the faulty disk had been replaced in a timely manner back in April/September, or at least regular backups done so that could be used instead of data recovery if running into problems with the volume/array.

JimTho · ‎2016-02-11

@mdgm wrote:
It's best not to learn how we do RAID, LVM etc. when doing data recovery. It's far better to power down, remove your disks (label order) and put some scratch disks in (must not be from your array) and setup a new volume and then power down, remove the scratch disks (label order) and experiment on the array that uses the scratch disks and doesn't matter, if you must.

That is ok, but luckily for me I found a solution that worked, in my case. And, I also learned from your feedback as I shared my experience.

I was pointing out that the problem could have been avoided in the first place if the faulty disk had been replaced in a timely manner back in April/September, or at least regular backups done so that could be used instead of data recovery if running into problems with the volume/array.

In this case the problem as stated before was not due to a faulty disk, but because I replaced a drive. It could as well be a fully functional drive. I rebooted the NAS because the GUI told me to "Please reboot your ReadyNAS device to continue with the update process."

In a mindless moment I thought this had to do with the expansion/new disk (I had not been doing this for some years) and clicked reboot in the gui. Now the update process the GUI reported was an automatic firmware update that took place at the same time. If the OS had been constructed better or been more "idiot-proof" it should not be allowed to reboot the device unless you pulled the power cord.

I see that you have not addressed all my questions, but you have provided a long reply. I will leave this post as is, and move on to secure my two NAS-units to make the data ready for the next disaster.

Volume scan failed to run properly #26188803

Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly

Re: Volume scan failed to run properly

Re: Volume scan failed to run properly

Re: Volume scan failed to run properly

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803

Re: Volume scan failed to run properly #26188803