OS 6.10.5 CLI management commands (for managing RAID groups)

anwsmh · ‎2021-06-10

Are there any CLI commands for managing RAID groups in OS 6 ?

I have just apparently serendipitously recovered from the Authentication Loop bug in the management interface (remove disks and boot with disk 1 only seemed to make it settle down) but it was such a nightmare that it would be good to have an alternative to the web interface.

I am comfortable with the CLI and don't really care about the web interface.

Thank you.

anwsmh · ‎2021-06-15

Thank you for your helpful remarks StephenB.

I am giving some attention to replacing the discs with more suitable models.

In the meantime, since it's manifest that I don't know what I am doing, I have bought Netgear support who will certainly do a better job getting me out this mess than me.

Thank you.

View solution in original post

StephenB · ‎2021-06-11

Managing RAID groups can be done with normal mdadm and btrfs commands.

But that's just the tip of the iceberg. Backup jobs, share settings, maintenance tasks, power schedules, email alerts, ...

Sandshark · ‎2021-06-11

The Netgear volume_util in usr/bin has a lot of options that let you do most of what you can do with volume creation/expansion from the GUI. But I've not discovered anything to work with shares.

anwsmh · ‎2021-06-11

Thank you StephenB.

I expect I am trouble with this (tick all the boxes: no web interface - sadly, it's back in the Authentication loop; and just about zero conceptual knowlege of the ReadyNAS RAID implementation) but since a data reshape is sitting on 2.75% (from front panel) and top shows load averages of 6.01, 6.08, 6.07 with the md127_raid5 process sitting on 100%, I was looking for alternative ways of knowing what's happening etc.

Sandshark · ‎2021-06-11

Well, that's what you use the mdadm and btrfs commands for. To see the status of RAIDs, including sync status, use cat /proc/mdstat. Or if you want a continuous update, use watch cat /proc/mdstat. This is not unique to the ReadyNAS. MDADM is a standard part of Linux and BTRFS is a common addition. So just Googling for standard Linux commands will get you started.

anwsmh · ‎2021-06-11

Thank you Sandshark.

The cat /proc/mdstat shows that sync is happening but at a rate of 33kB/sec, so it looks like trying to do stuff (in this case, adding a disk) without adequate preparation will earn it's usual reward.

StephenB · ‎2021-06-11

@anwsmh wrote:

The cat /proc/mdstat shows that sync is happening but at a rate of 33kB/sec,

You could query the smart stats for each disk using smartctl. Also, you can use journalctl to see if there are any errors going on that might be slowing the sync.

FWIW, I always test my new disks in a PC using vendor tools before adding them to the NAS. I run the full non-destructive test, and follow that up with the full erase/write zeros test.

I sometimes have had new drives pass the short smart tests, but fail one of those more extensive tests.

Sandshark · ‎2021-06-11

Yeah, that's incredibility slow. What type drive did you add? If it's one that's SMR, such as the WD EFAX series, then that's the likely problem, and you are in for more. But don't just stop anything at this point, even if that is it. But do use the top command to see if it looks like there are any processes being held or stacked up by the sync.

I have also seen instances where the NAS has turned off drive buffering during re-sync, which results in very slow syncs. It may supposed to only do that if there is no UPS, but I've seen it in other cases. Do you have an UPS attached?

What result do you get from cat /proc/sys/dev/raid/speed_limit_max and cat /proc/sys/dev/raid/speed_limit_min? What about hdparm -W /dev/sdX where you repeat where X is each of the drives in the RAID? If your RAID is across all 6 drives, that is probably either a-f or a-g minus one somewhere in that range that was the one you replaced.

anwsmh · ‎2021-06-11

Hi Sandshark,

Firstly, I am very grateful for your comments. Things are getting worse in that mdstat shows the speed has dropped to 13k.

The new drive (added to an existing heterogenous set of 4) is a WDC WD10EZEX-21M2NA0.
top shows
<pre>
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1456 root 20 0 0 0 0 R 100.0 0.0 1279:33 md127_raid5
3189 root 19 -1 2898228 41608 25448 S 1.3 2.0 20:14.24 readynasd
2043 root 20 0 28764 3040 2484 R 0.7 0.1 0:00.30 top
1 root 20 0 204356 7244 5392 S 0.0 0.4 0:06.64 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:01.09 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root 20 0 0 0 0 S 0.0 0.0 0:39.56 rcu_sched
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
9 root rt 0 0 0 0 S 0.0 0.0 0:01.20 migration/0
10 root rt 0 0 0 0 S 0.0 0.0 0:00.41 watchdog/0
11 root rt 0 0 0 0 S 0.0 0.0 0:00.51 watchdog/1
12 root rt 0 0 0 0 S 0.0 0.0 0:01.05 migration/1
13 root 20 0 0 0 0 S 0.0 0.0 0:01.05 ksoftirqd/1
15 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H
16 root rt 0 0 0 0 S 0.0 0.0 0:01.21 watchdog/2
17 root rt 0 0 0 0 S 0.0 0.0 0:00.66 migration/2
18 root 20 0 0 0 0 S 0.0 0.0 0:01.17 ksoftirqd/2
20 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0H
21 root rt 0 0 0 0 S 0.0 0.0 0:00.64 watchdog/3
22 root rt 0 0 0 0 S 0.0 0.0 0:00.90 migration/3

root@nas-E7-02-2C:~# cat /proc/sys/dev/raid/speed_limit_max
200000
root@nas-E7-02-2C:~# cat /proc/sys/dev/raid/speed_limit_min
30000

</pre>

There is no UPS.

hdparm -W on the 5 members of the raid group (a - e) shows write caching on.

anwsmh · ‎2021-06-11

Hi StephenB,

Thank you for your encouragement and assistance (and the sound advice about preparation)

As far as I can tell, smarctl shows a handful of errors (<=9) on one device (not the addition) and nothing on the new disc ("No errors logged"). The device logging 9 errors does so at power on and the errors are related to various disk blocks (I think.Only two LBAs are repeated). All devices report healthy.

I am unfamiliar with journalctl (when I run journalctl --system -o short -r, what is mainly obvious is HTTP errors from frontview)

anwsmh · ‎2021-06-12

Hi Sandshark,

Thanks for your interest and encouragement on what looks like a very bad trip down "stupidity lane".

The transfer rate has dropped to 9K with an estimated duration of a couple of months.

In fact, without the Frontview interface I am hosed, since I don't think I can get the data off it (it goes without saying that a box that this sort of thing happens to, regardless of the stupidity of the owner, is a liability and not an asset. But, that's in the far distant future.

There were 4 devices in the RAID set (sda - d), have added one (a WDC WD10EZEX) to a heterogeneous set.

hdparm shows all have write cache enabled.

Here's the top output,

top - 18:54:05 up 1 day, 8:41, 1 user, load average: 8.00, 8.00, 7.72
Tasks: 230 total, 2 running, 228 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.1 us, 25.1 sy, 0.0 ni, 24.9 id, 50.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 2032668 total, 1289520 used, 743148 free, 6460 buffers
KiB Swap: 1305596 total, 0 used, 1305596 free. 405160 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1456 root 20 0 0 0 0 R 99.7 0.0 1869:53 md127_raid5
9404 root 20 0 28764 3040 2492 R 0.3 0.1 0:00.08 top
1 root 20 0 204356 7244 5392 S 0.0 0.4 0:06.90 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.05 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:01.11 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:+
7 root 20 0 0 0 0 S 0.0 0.0 0:49.75 rcu_sched
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
9 root rt 0 0 0 0 S 0.0 0.0 0:01.25 migration/0
10 root rt 0 0 0 0 S 0.0 0.0 0:00.56 watchdog/0
11 root rt 0 0 0 0 S 0.0 0.0 0:00.72 watchdog/1
12 root rt 0 0 0 0 S 0.0 0.0 0:01.09 migration/1
13 root 20 0 0 0 0 S 0.0 0.0 0:01.07 ksoftirqd/1
15 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:+
16 root rt 0 0 0 0 S 0.0 0.0 0:01.68 watchdog/2
17 root rt 0 0 0 0 S 0.0 0.0 0:00.67 migration/2
18 root 20 0 0 0 0 S 0.0 0.0 0:01.18 ksoftirqd/2

Any advice warmly received.

Thank you.

StephenB · ‎2021-06-12

@anwsmh wrote:

As far as I can tell, smarctl shows a handful of errors (<=9) on one device (not the addition) and nothing on the new disc ("No errors logged"). The device logging 9 errors does so at power on and the errors are related to various disk blocks (I think.Only two LBAs are repeated). All devices report healthy.

Probably worth following up on later (and maybe use smartctl or the web ui to test the disks). But it doesn't like sound it is related to the sync speed issue.

But I think @Sandshark's thought is worth pursuing. WD20EFAX, WD30EFAX, WD40EFAX, and WD60EFAX are SMR drives, and if you are using them you will see highly variable write times and very long sync times. There are some desktop drives (and shucked USB drives) that are also SMR (both Seagate and Western Digital).

anwsmh · ‎2021-06-13

Thanks for everyone's patience with this thread.

root@nas-E7-02-2C:~# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md123 : active raid1 sde7[1] sdc7[0]
175683264 blocks super 1.2 [2/2] [UU]
resync=DELAYED

md1 : active raid10 sde2[4] sdd2[3] sdc2[2] sdb2[1] sda2[0]
1305600 blocks super 1.2 512K chunks 2 near-copies [5/5] [UUUUU]

md124 : active raid5 sde6[3] sdd6[2] sdc6[1]
68307072 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]
resync=DELAYED

md125 : active raid5 sde5[4] sdd5[3] sda5[0] sdc5[2]
331833472 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
resync=DELAYED

md126 : active raid5 sde4[5] sdd4[4] sda4[0] sdc4[3]
39041280 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
resync=DELAYED

md127 : active raid5 sde3[6] sdd3[5] sda3[0] sdc3[4] sdb3[1]
161586624 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
[====>................] reshape = 24.7% (13305472/53862208) finish=165179.5min speed=4K/sec

md0 : active raid1 sde1[8] sdd1[7] sda1[0](W) sdc1[6](W) sdb1[5]
4190208 blocks super 1.2 [5/5] [UUUUU]

unused devices: <none>
root@nas-E7-02-2C:~#

I think this means that the reshape will finish (at 4KB/sec) in about 115 days. Maybe other resyncs will happen after.

The data on this NAS (on discs sda-d) is only personal but I don't want to have to wait 4 months to access it (I have no access to the shares the NAS exports).

The data on the new drive was junk (the drive as added in "safe mode" after only a format. It was not added to a volume group since the "New volume" in frontview was greyed out).

Are there any safe options for getting access to the data on the NAS ? Can I pull sde ?

anwsmh · ‎2021-06-14

My long posts (with top and mdstat) appear to be rejected or filtered by the posting application.

So, short and sweet: mdstat tells me that the reshape is motoring along at 4kB/sec and will finish in 4 months (120 days roughly).

My data is only personal but I can't live without it for so long (or at least it will be a hardship. Already resetting passwords since I can't access my escrow data and saying goodbye to iTunes music and personal photos is unpleasant).

The new disk was added in "safe" mode and then formatted but not added to a volume group (the "New volume" button was grayed out) so I don't know whether it's copying data from the new to the former (4) disks or replicating the former disk data on the new one.

Can I rip the new disk out - in safe mode, and expect to get back a NAS ?

Are there any other options to waiting so long ?

Thank you for your patience with this idiotic thread (yes, I am the idiot)

StephenB · ‎2021-06-14

@anwsmh wrote:

My long posts (with top and mdstat) appear to be rejected or filtered by the posting application.

There is an automatic spam filter that triggered. The mods check the quarantine periodically, but it can take a while. I can also release them, so you can send me a PM if it happens again.

@anwsmh wrote:

So, short and sweet: mdstat tells me that the reshape is motoring along at 4kB/sec and will finish in 4 months (120 days roughly).

It'd be very helpful if you told us the manufacturer and model of the disk you inserted.

@anwsmh wrote:

The new disk was added in "safe" mode

Why are you using safe mode? Normally you are only in that mode when something is wrong.

anwsmh · ‎2021-06-14

The new disc is a Western Digital PC grade WD10EZEX-21M2NA0.

I added it in what I thought was safe mode (but probably isn't. I got it in this mode by booting with only disc 1 inserted) because I have no access to the Web UI and I was foolish enough to think that adding a disc to a raid group was unlikely to cause problems; that working round a problem was unlikely to lead to other problems etc.

No mitigating circs.

StephenB · ‎2021-06-14

@anwsmh wrote:

The new disc is a Western Digital PC grade WD10EZEX-21M2NA0.

Ok, that is CMR, not SMR. FWIW, I recommend sticking with CMR NAS-purposed disks (or enterprise class). So I would have gone with either a Seagate Ironwolf (ST1000VN002) or a WDC Red Plus (WD10EFRX).

@anwsmh wrote:

I added it in what I thought was safe mode (but probably isn't. I got it in this mode by booting with only disc 1 inserted) because I have no access to the Web UI

If one disk was inserted, the system would have booted up in the normal mode (from disk 1). You wouldn't have been able to access the volume, since you wouldn't have enough disks to mount the RAID array. I'm guessing you then powered down, inserted the other disks, and powered up?

The best way to do have proceeded with the disk replacement was to hot-swap the disk (with the NAS running), or to hot-insert it (if you are adding a disk to an empty bay). What you actually did should have worked (assuming you did power down to reinsert the missing disks), but IMO wasn't that safe.

I am puzzled on how you would have gotten access to the web ui with only one disk installed - though there isn't an clear understanding of what is causing the log-in loop with 6.10.5. It would have been better to overcome the log-in loop before replacing a disk. But of course that is past.

Do you have a backup of the data?

If not, are you getting access to the files from Windows File Explorer? (although it could be slow)?

anwsmh · ‎2021-06-15

Thank you for your helpful remarks StephenB.

I am giving some attention to replacing the discs with more suitable models.

In the meantime, since it's manifest that I don't know what I am doing, I have bought Netgear support who will certainly do a better job getting me out this mess than me.

Thank you.

Sandshark · ‎2021-06-15

Wow, that the set of drives is heterogeneous is an understatement. You have a lot of RAID groups because of the multiple drive sizes. Yes, the other RAIDs will sync one at a time as each finishes, but I rather think that may never come. And I don't think the variety of drive sizes is a factor here. But that it will take that long just to re-sync the 1TB group is an indication something is very wrong. Normally, the other groups wouldn't need a re-sync since the new drive isn't a part of them. I think that means that booting with just one drive had an effect, and that might be at least a part of the problem.

TOP shows the RAID process is taking 100% of the CPU, but the companion kworker processes don't seem to be doing much. I'm not sure why that is, but it's bad. Unless one of the original drives is SMR, my suspicion is one of the drives is bad. But removing a drive at this point is a really a bad idea, though, even if it is bad.

Since you didn't mention having a similar issue the last time you added or swapped a drive, I doubt one of the others is SMR. But if you know the brands and models (which are in disk_info.log in the log .zip file if you have a saved one), it would be good to check.

Getting Netgear assistance does seem your best bet. If they can slow down the sync even more, then there will be enough of the CPU for SMB or the GUI and backup process to run so you can get files off the NAS. You really need to do that, because the only way I think you'll ever fix the issue is to test all the drives, remove the one that is bad (I think there has to be one), and do a factory default.

BTW, if you use the CODE INSERTION tool (looks like </>) when you post text grabbed from SSH, it will be formatted in a non-proportional font so it's a lot easier to read.

OS 6.10.5 CLI management commands (for managing RAID groups)

OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)

Re: OS 6.10.5 CLI management commands (for managing RAID groups)