Firmware update of RN516 with two EDA500 failed

Sandshark · ‎2017-10-22

I had not updated my RN516 firmware since I added my two EDA500's. I just updated from 6.7.4 to 6.8.1 and it failed. The firmware loaded, but on reboot it did not boot properly -- it got part way and then re-started. After a couple power cycles to see if that would make a difference, I ultimately powered down and disconnected the EDA500's. The NAS then booted, stating that it was doing a firmware update (meaning it didn't before), then running normally except that the EDA500 volumes were "phantoms" (unknown RAID type).

I powered down and re-connected one EDA500, and it again booted properly with just one phantom volume, then did it again for the other EDA500 and all appeared normal. I then found the shares on the EDA volumes were not showing up in Windows, I went in to check the permissions and found all but SMB disabled. I reset all share protocols to what I perviously had, including toggling SMB off and on, and that fixed it. I realized that was exactly what one has to do after importing a volume that was exported (See /Experiments-with-exporting-and-importing-a-volume-in-OS6-7.1+). I was happy it all worked, though a bit surprised it seemed to have re-imported the volume when there was already a phantom of it. I certainly didn't want to have to start restoring backup data.

The better option probably would have been to try and put it in support mode and get help, but I'm not the original purchaser and have no warranty support. I don't know if it even would have booted to support mode -- I didn't try it. I also wish I had tried only disconnecting one EDA500 first -- I don't know why I didn't.

So once it was all operational, I went in to verify all the volumes looked normal and to see what might have been the problem. A lsblk command in SSH gave me this:

NAME      MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINT
sda         8:0    0   5.5T  0 disk
├─sda1      8:1    0     4G  0 part
│ └─md0     9:0    0     4G  0 raid1  /
├─sda2      8:2    0   512M  0 part
│ └─md1     9:1    0   1.5G  0 raid10 [SWAP]
├─sda3      8:3    0   3.6T  0 part
│ └─md127   9:127  0  18.2T  0 raid5  /data
└─sda4      8:4    0   1.8T  0 part
  └─md126   9:126  0   9.1T  0 raid5
sdb         8:16   0   5.5T  0 disk
├─sdb1      8:17   0     4G  0 part
│ └─md0     9:0    0     4G  0 raid1  /
├─sdb2      8:18   0   512M  0 part
│ └─md1     9:1    0   1.5G  0 raid10 [SWAP]
├─sdb3      8:19   0   3.6T  0 part
│ └─md127   9:127  0  18.2T  0 raid5  /data
└─sdb4      8:20   0   1.8T  0 part
  └─md126   9:126  0   9.1T  0 raid5
sdc         8:32   0   5.5T  0 disk
├─sdc1      8:33   0     4G  0 part
│ └─md0     9:0    0     4G  0 raid1  /
├─sdc2      8:34   0   512M  0 part
│ └─md1     9:1    0   1.5G  0 raid10 [SWAP]
├─sdc3      8:35   0   3.6T  0 part
│ └─md127   9:127  0  18.2T  0 raid5  /data
└─sdc4      8:36   0   1.8T  0 part
  └─md126   9:126  0   9.1T  0 raid5
sdd         8:48   0   5.5T  0 disk
├─sdd1      8:49   0     4G  0 part
│ └─md0     9:0    0     4G  0 raid1  /
├─sdd2      8:50   0   512M  0 part
│ └─md1     9:1    0   1.5G  0 raid10 [SWAP]
├─sdd3      8:51   0   3.6T  0 part
│ └─md127   9:127  0  18.2T  0 raid5  /data
└─sdd4      8:52   0   1.8T  0 part
  └─md126   9:126  0   9.1T  0 raid5
sde         8:64   0   5.7T  0 disk
├─sde1      8:65   0     4G  0 part
│ └─md0     9:0    0     4G  0 raid1  /
├─sde2      8:66   0   512M  0 part
│ └─md1     9:1    0   1.5G  0 raid10 [SWAP]
├─sde3      8:67   0   3.6T  0 part
│ └─md127   9:127  0  18.2T  0 raid5  /data
└─sde4      8:68   0   1.8T  0 part
  └─md126   9:126  0   9.1T  0 raid5
sdf         8:80   0   5.5T  0 disk
├─sdf1      8:81   0     4G  0 part
│ └─md0     9:0    0     4G  0 raid1  /
├─sdf2      8:82   0   512M  0 part
│ └─md1     9:1    0   1.5G  0 raid10 [SWAP]
├─sdf3      8:83   0   3.6T  0 part
│ └─md127   9:127  0  18.2T  0 raid5  /data
└─sdf4      8:84   0   1.8T  0 part
  └─md126   9:126  0   9.1T  0 raid5
sdg         8:96   0   3.7T  0 disk
├─sdg1      8:97   0     4G  0 part
├─sdg2      8:98   0   512M  0 part
├─sdg3      8:99   0   2.7T  0 part
│ └─md125   9:125  0  10.9T  0 raid5  /eda1
└─sdg4      8:100  0 931.5G  0 part
  └─md124   9:124  0   3.7T  0 raid5
sdh         8:112  0   3.7T  0 disk
├─sdh1      8:113  0     4G  0 part
├─sdh2      8:114  0   512M  0 part
├─sdh3      8:115  0   2.7T  0 part
│ └─md125   9:125  0  10.9T  0 raid5  /eda1
└─sdh4      8:116  0 931.5G  0 part
  └─md124   9:124  0   3.7T  0 raid5
sdi         8:128  0   3.7T  0 disk
├─sdi1      8:129  0     4G  0 part
├─sdi2      8:130  0   512M  0 part
├─sdi3      8:131  0   2.7T  0 part
│ └─md125   9:125  0  10.9T  0 raid5  /eda1
└─sdi4      8:132  0 931.5G  0 part
  └─md124   9:124  0   3.7T  0 raid5
sdj         8:144  0   3.7T  0 disk
├─sdj1      8:145  0     4G  0 part
├─sdj2      8:146  0   512M  0 part
├─sdj3      8:147  0   2.7T  0 part
│ └─md125   9:125  0  10.9T  0 raid5  /eda1
└─sdj4      8:148  0 931.5G  0 part
  └─md124   9:124  0   3.7T  0 raid5
sdk         8:160  0   3.7T  0 disk
├─sdk1      8:161  0     4G  0 part
│ └─md0     9:0    0     4G  0 raid1  /
├─sdk3      8:163  0   2.7T  0 part
│ └─md125   9:125  0  10.9T  0 raid5  /eda1
└─sdk4      8:164  0 931.5G  0 part
  └─md124   9:124  0   3.7T  0 raid5
sdl         8:176  0   1.8T  0 disk
├─sdl1      8:177  0     4G  0 part
│ └─md0     9:0    0     4G  0 raid1  /
└─sdl3      8:179  0   1.8T  0 part
  └─md123   9:123  0   7.3T  0 raid5  /eda2
sdm         8:192  0   1.8T  0 disk
├─sdm1      8:193  0     4G  0 part
│ └─md0     9:0    0     4G  0 raid1  /
└─sdm3      8:195  0   1.8T  0 part
  └─md123   9:123  0   7.3T  0 raid5  /eda2
sdn         8:208  0   1.8T  0 disk
├─sdn1      8:209  0     4G  0 part
│ └─md0     9:0    0     4G  0 raid1  /
└─sdn3      8:211  0   1.8T  0 part
  └─md123   9:123  0   7.3T  0 raid5  /eda2
sdo         8:224  0   1.8T  0 disk
├─sdo1      8:225  0     4G  0 part
│ └─md0     9:0    0     4G  0 raid1  /
└─sdo3      8:227  0   1.8T  0 part
  └─md123   9:123  0   7.3T  0 raid5  /eda2
sdp         8:240  0   1.8T  0 disk
├─sdp1      8:241  0     4G  0 part
│ └─md0     9:0    0     4G  0 raid1  /
└─sdp3      8:243  0   1.8T  0 part
  └─md123   9:123  0   7.3T  0 raid5  /eda2

What I noted is different between the various volumes, and seems like it might be a part of the problem, is that while all the drives on EDA1 have the normal partitioning for root, swap, and user, most of them are not mounted to md0 and none are to md1. The second partition of each EDA2 drive isn't mounted to md1, either, so I assume that's normal. But the fact that drive 5 on eda1 has the first partition mounted to md0 and the rest don't seems like it could be a problem. What's odder is that all the drives on eda1 were added at the same time and before eda2 was ever added and I re-connected them in that same order when I disconnected them to resolve this issue. But eda1 is on eSATA port 3 and eda2 on eSATA port 1, so maybe that's a factor. On boot, I suspect is scans the eSATA ports in order. But the drive labelks are in the order of main chassis, eda1 (port 3), then eda2 (port 1). I frankly have no idea if all this was always the case -- everything was working fine and I never looked at it.

If the OS update tried to update the first partition on all drives, it couldn't because there is no mount point; but I don't know if it needs/wants to. Then I got to thinking that I have 12 drives mounted to md0 -- the maximum number of drives in any ReadyNAS chassis. So, that seems like it could explain why only that many are in md0. But is that also the root cause of the problem? Is there something I should do (even if it means destroying and re-creating the eda1 volume) to assure this will not happen again? Or is there something that Netgear needs to do to acomodate this configuration? I very well may have an untested configuration of >12 drives, but I'm not going to test beta software on the NAS I use all the time -- I don't even put the latest release on it until I've seen there are not too many problems reported here. (Case in point, I just went to 6.8.1 on it.)

Skywalker · ‎2017-11-15

From the 6.9.1 Beta 1 release notes:

[Beta 1] Fixed a rare issue where a volume with multiple RAID groups may temporarily fail to mount after reboot.

That may have been the issue you hit. It was a race condition during boot time that has been possible to encounter since 6.7.0.

Since EDA500 volumes are so slow due to the fact that they use a single SATA port multiplier, they are intentionally kept out of swap. They are also kept out of the read list for the root volume.

Prior to 6.9.0, no attempt would be made to add failed disks to the root RAID array, as long as the data portion was still intact. Proactive root repair was added in 6.9.0.

View solution in original post

Sandshark · ‎2017-10-29

No comment from anybody at Netgear?

Tinyhorns · ‎2017-10-30

Whatever you do, do not upgrade to 6.9.0, that update is seriosly broken.

Hopefully they are working on a fix, and it will be out soon.

// T

Sandshark · ‎2017-11-06

Still nothing? Hey, I need to know what I should do when the time comes for the next update. Was this a fluke, or is there a fundamental problem? If it's a problem, are you going to fix it or do I need to do something to avoid it?

Skywalker · ‎2017-11-15

From the 6.9.1 Beta 1 release notes:

[Beta 1] Fixed a rare issue where a volume with multiple RAID groups may temporarily fail to mount after reboot.

That may have been the issue you hit. It was a race condition during boot time that has been possible to encounter since 6.7.0.

Since EDA500 volumes are so slow due to the fact that they use a single SATA port multiplier, they are intentionally kept out of swap. They are also kept out of the read list for the root volume.

Prior to 6.9.0, no attempt would be made to add failed disks to the root RAID array, as long as the data portion was still intact. Proactive root repair was added in 6.9.0.

Sandshark · ‎2017-11-16

That does sound like a likely scenario, and I'm doubly at risk with two EDA500's. So, it may happen again when I finally go to 6.9.x, but hopefully not after that. It'll take a while before I'm sure, but I'll mark that one as most likely solved. Thanks for the reply.

Firmware update of RN516 with two EDA500 failed

Firmware update of RN516 with two EDA500 failed

Re: Firmware update of RN516 with two EDA500 failed

Re: Firmware update of RN516 with two EDA500 failed

Re: Firmware update of RN516 with two EDA500 failed

Re: Firmware update of RN516 with two EDA500 failed

Re: Firmware update of RN516 with two EDA500 failed

Re: Firmware update of RN516 with two EDA500 failed