RN314 problems after Readynas 6.10.7

alfb · ‎2022-09-30

After install of 6.10.7 (apparently trouble free) on a RN314 (4*6TD WD drves) that has been in use for years, the gui degrades to the point where no data appears in the admin page.

In the short time after boot where admin page can be read, it shows that we are in resync at about 34.xx% complete. Using ssh and cat /proc/mdstat it shows 5.9% complete and verrrry slowly increments over the past few days. Converting the minutes left to days shows more that a month to complete the resync!

The log, when available, shows increasing disk errors on one of the four drives.

Can I replace the faulty drive with a new one, or must I wait for the resync to complete?

If I can replace the drive, is the proper procedure to hot swap the drive, or should I shutdown, replace the drive and power up?

Given the lack of normal behaviour of the admin gui, I am cautious about attempting to either:

upgrade to 6.10.8
downgrade to 6.10.6
reinstall 6.10.7

top shows no stopped or zombie processes, cpu time is approx 50% idle and 50% wait. No swap usage. The only significant accumulation goes to md126_raid5

I'd appreciate advice from anyone more knowledgeable than me as to what steps to take.

Thanks in advance.

Alf

StephenB · ‎2022-10-02

@alfb wrote:

So should I hot swap sdb (drive 2 as 1,2,3,4) or is it sbd (drive 2 as in 0,1,2,3) The output from mdstat infers drive numbers are 0,1,2,3

And are these numbers from left to right as looking from the front??

One caution is that the error in the screen shot is a read error - normally if mdadm is rebuilding a disk, it is writing to it. So there still is some uncertainty on exactly what is going on.

sdb is normally the second disk from the left (the disk in the first slot normally is sda).

If you aren't certain, you can also get the serial number with smartctl -x /dev/sdb (along with a lot of other information). Then you could power down the NAS, check the disk serial, and boot up the NAS read-only without it. If the volume is there, but degraded, then you could reboot again normally. Then do a hot-insert of the replacement.

Instructions on booting the NAS read-only are on pages 74-75 here:

https://www.downloads.netgear.com/files/GDC/READYNAS-100/ReadyNAS_%20OS6_Desktop_HM_EN.pdf

View solution in original post

StephenB · ‎2022-09-30

@alfb wrote:

Can I replace the faulty drive with a new one, or must I wait for the resync to complete?

If the drive that is resyncing is the faulty one, then you don't need to wait.

But if it is a different drive, then you do.

@alfb wrote:

If I can replace the drive, is the proper procedure to hot swap the drive, or should I shutdown, replace the drive and power up?

I generally recommend a hot swap. The system then detects the removal and reinsertion, and doesn't need to figure out that a drive was replaced.

@alfb wrote:

Given the lack of normal behaviour of the admin gui, I am cautious about attempting to either:

upgrade to 6.10.8

downgrade to 6.10.6

reinstall 6.10.7

The failing disk could be the cause of the poor performance. I wouldn't do any of these things at present.

alfb · ‎2022-10-01

StephenB,

Thanks for your advice.

Re

If the drive that is resyncing is the faulty one, then you don't need to wait.

But if it is a different drive, then you do.

I am not sure which drive is resyncing. I will try to copy the output of /proc/mdstat to show you what I know. It shows md126 with the activity, but no indication that I can see of drive 1 or 2 or 3 or 4

Perhaps there is a different cli command?

Regards, Alf

StephenB · ‎2022-10-01

@alfb wrote:

Perhaps there is a different cli command?

Try mdadm --detail /dev/md126

alfb · ‎2022-10-01

StephenB, thanks again for your interest.

Here is a screen shot of the command output. I'm still no wiser as to which physical drive is being resynced.

Regards, Alf

StephenB · ‎2022-10-01

@alfb wrote:

I'm still no wiser as to which physical drive is being resynced.

Me either. I was expecting to see one of the drives saying "rebuilding" for its status at the bottom.

Maybe try the same command with /dev/md127 ??

alfb · ‎2022-10-02

Hello StephenB,

I am including the screen shot of the command for md127. It shows no resync.

Also shows bottom of output command for md126 showing progress to 7%.

Am I wrong in thinking the "devices" md126, md127, md0, and md1 refer to a raid group and not an individual physical drive?

Is there a difference between resync and rebuild?

Just FYI. Display panel shows all lights "1 2 3 4 Activity Power" on solid.

Sandshark · ‎2022-10-02

You are correct that those are RAID groups, composed of partitions on the physical drives. The commands you issued show what partitions are in each RAID group and what "RAID device" is assigned to each.

If you also look at the results of cat /proc/mdstat, you'll see a list of the current members of the RAID and a string that indicates what's missing. For example, this is mine for md126:

md126 : active raid6 sda3[9] sdl3[15] sdi3[12] sdj3[13] sdk3[14] sdg3[10] sdh3[11] sdf3[16] sde3[4] sdc3[6] sdb3[7] sdd3[8]
      58556728320 blocks super 1.2 level 6, 64k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]

Do you see an underscore in place of one of the U's, indicating the "missing" drive and one of the drives from the mdadm --detail command not showing up in the mdstat list?

volume_util -s raid and/or volume_util -s disk will help you match each md device with it's physical location. Note that channel labels start with 0, so the drive in channel 0 is in bay 1.

If that's not the case, then it must never have dropped a partition completely from the array, which may be confusing the issue.

alfb · ‎2022-10-02

Thanks for the reply. There are NO "_" in the lists for all four (md126, md127,md0,md1) all show four "U"s.

I have a new 6Tb drive to replace when the resync completes. But the resync for md126 is still just over 7% after 3 days, according to cat /proc/mdstat . A sample of the output is included in previous attachments.

In this attachment I show the only significant differences in the output of four mdadm --detail commands, one for each of md0,md1,md126,md127

plus a current output from cat mdstat

plus a df (where is md126 used?)

Regards, and thanks, Alf

StephenB · ‎2022-10-02

@alfb wrote:

(where is md126 used?)

md127 and md126 are concatenated together and mounted as your data volume.

@alfb wrote:

I have a new 6Tb drive to replace when the resync completes. But the resync for md126 is still just over 7% after 3 days, according to cat /proc/mdstat . A sample of the output is included in previous attachments.

The puzzle is figuring out exactly what is being rebuilt. There are some operations (for instance a RAID scrub) that look like a resync, but aren't rebuilding any drive.

But if you remove the wrong drive during an actual resync, then you will lose all the data.

Have you been able to download the log zip file? There could be errors in there that would be helpful.

You can also try looking at the log from ssh using journalctl -r

alfb · ‎2022-10-02

StephenB,

Thanks for this recommendation. I used journalctl -r and found many references to errors on sdb and sdb4 .

The only way I know to show others is with a screen shot so my output is in .png format. But this limits me to a single page.

So should I hot swap sdb (drive 2 as 1,2,3,4) or is it sbd (drive 2 as in 0,1,2,3) The output from mdstat infers drive numbers are 0,1,2,3

And are these numbers from left to right as looking from the front??

I am still seeing ~7.5% completion on the resync after 4ish days. So waiting for a completion may see me being a very old man.

If swapping sdb ends in failure, I guess my next step is to do the factory reset reboot?? Or is there a wiser way?

Regards, and thanks, Alf

StephenB · ‎2022-10-02

@alfb wrote:

So should I hot swap sdb (drive 2 as 1,2,3,4) or is it sbd (drive 2 as in 0,1,2,3) The output from mdstat infers drive numbers are 0,1,2,3

And are these numbers from left to right as looking from the front??

One caution is that the error in the screen shot is a read error - normally if mdadm is rebuilding a disk, it is writing to it. So there still is some uncertainty on exactly what is going on.

sdb is normally the second disk from the left (the disk in the first slot normally is sda).

If you aren't certain, you can also get the serial number with smartctl -x /dev/sdb (along with a lot of other information). Then you could power down the NAS, check the disk serial, and boot up the NAS read-only without it. If the volume is there, but degraded, then you could reboot again normally. Then do a hot-insert of the replacement.

Instructions on booting the NAS read-only are on pages 74-75 here:

https://www.downloads.netgear.com/files/GDC/READYNAS-100/ReadyNAS_%20OS6_Desktop_HM_EN.pdf

alfb · ‎2022-10-03

Hello StephenB,

Yes the journalctl output seems to be read errors only.

The smartctl output from sda, sdc, and sdd are "snappy" as in they finish momentarily after pressing enter.

The smartctl output from sdb takes minutes to finish and does indicate a large number of read errors. The serial number as reported by smartctl matches the serial number in the error log.

Re " If the volume is there, but degraded, then you could reboot again normally. " Since my access to the gui is unreliable, is

cat /proc/mdstat

a good place to verify "the volume is there'?

Regards and thanks, Alf

Sandshark · ‎2022-10-03

The NAS does a MDADM re-sync when it does a BTRFS scrub. Maybe that's what kicked it off. What does btrfs scrub status /data show?

The sync during scrub is truly at the same time, which causes it to take a long time.

StephenB · ‎2022-10-03

@alfb wrote:

Re " If the volume is there, but degraded, then you could reboot again normally. " Since my access to the gui is unreliable, is

cat /proc/mdstat

a good place to verify "the volume is there'?

The web ui should respond normally after you boot up read-only, as it is pretty clear that the performance issues are related to the disk errors.

So one way is to look at the volume page in the web ui when you boot up read-only.

If the volume is marked as degraded, you can then make sure the data is accessible from a PC. If it is, just reboot normally
If the volume is marked as "inactive" or dead, then the system is trying to rebuild another disk. In that situation, you need some form of data recovery. I don't think that's likely, as the commands we've already run normally will show you what disk is rebuilding. While this scenario is possible, I don't think it makes sense to wait for the resync to complete.

You could also check with ssh:

cat /proc/mdstat will tell you the status of the raid groups
Examining the data volume will let you know that the data volume is mounted and that the data is accessible - for instance ls /data should show you the shares (if you are using data as the volume name). Similarly, ls /data/sharename should show the files and folders in the share. If ls /data doesn't show the shares, then the volume failed to mount.

alfb · ‎2022-10-03

StephenB: I will proceed with the steps you outlined.

Sandshark: output of btrfs command

root@CTS-NG314:/proc# btrfs scrub status /data
scrub status for 9f66362a-5347-4ccb-9950-77ab32406926
no stats available
total bytes scrubbed: 0.00B with 0 errors
root@CTS-NG314:/proc#

alfb · ‎2022-10-04

StephenB and Sandshark,

The replacement of the drive and rebuild completed successfully around 03:30 am, and elapsed time of 15 hours. Reliable access to the gui right after drive replacement.

In summary, perusing the log, showed drive had been reporting errors since July, but email notifications were not working so I missed them!

My inference in the subject line that the upgrade to 6.10.7 was implicated was totally off base!

Thank you for your interest and guidance.

Alf

StephenB · ‎2022-10-04

@alfb wrote:
The replacement of the drive and rebuild completed successfully around 03:30 am, and elapsed time of 15 hours. Reliable access to the gui right after drive replacement.

Thx for following up, and I'm glad the problem is resolved.

RN314 problems after Readynas 6.10.7

RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7

Re: RN314 problems after Readynas 6.10.7