RN102 broken after OS 6.10.4 update

mdgm
Virtuoso
Apr 29, 2021
mdstat.log shows info on the RAID levels more concisely

md0 is the 4GB root volume and uses RAID-1. The system will still boot if a disk is removed however any RAID-0 data volume using that disk won’t be able to be mounted as the RAID array won’t be able to be started.
dstsui
Aspirant
Apr 29, 2021
Based on the logs, do you think the data in disk 1 is recoverable using WD disk utility? Can you tell which Share will be affected by the loss of disk 1?
dstsui
Aspirant
Apr 29, 2021
The following is extracted from mdstat:

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md127 : active raid1 sda3[0] sdb3[1]
1948662784 blocks super 1.2 [2/2] [UU]
bitmap: 0/15 pages [0KB], 65536KB chunk
md1 : active raid1 sda2[0] sdb2[1]
523712 blocks super 1.2 [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[1]
4190208 blocks super 1.2 [2/2] [UU]

If md127 is the data volume (size appears correct), it is RAID1. RAID0 does not appear in the log.

Aspirant

Apr 29, 2021

/dev/md/data-0:
           Version : 1.2
     Creation Time : Sun Jan 11 06:16:09 2015
        Raid Level : raid1
        Array Size : 1948662784 (1858.39 GiB 1995.43 GB)
     Used Dev Size : 1948662784 (1858.39 GiB 1995.43 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Thu Apr 29 07:28:54 2021
             State : clean 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : unknown

              Name : 0e361c14:data-0  (local to host 0e361c14)
              UUID : 93079119:a9de47f1:0351ef69:148df03c
            Events : 5138

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3

If md127 is indeed the data volume, then everything looks good to me:

mdgm
Virtuoso
Apr 30, 2021
Yes, md127 is the data volume and in your case it's using RAID-1.

If there were multiple volumes or multiple RAID layers for a volume then you'd see other RAID arrays like md126 etc.

md0 is the 4GB root volume
md1 is the swap

Note that with RAID-1 you can still have issues like filesystem corruption, accidental file deletion etc.

RAID doesn't replace the need for backups. No important data should be stored on just the one device.
StephenB
Guru - Experienced User
Apr 30, 2021
dstsui wrote:

Based on the logs, do you think the data in disk 1 is recoverable using WD disk utility?

I don't recommend trying that. In my experience such "recovery" is more of a bandaid than a cure. When I've tried it, the disk very quickly starts throwing errors again.

Another aspect is that you are experiencing read errors. The disk utility can't preserve the data if it can't read the disk. So the volume would still need to be resynced, otherwise you won't have a proper mirror.

A replacement drive would cost ~$80 (current USD pricing). IMO attempting a "repair" isn't worth the risk.

But you are jumping ahead. The next step is to test the disk. I suspect it will fail.

dstsui wrote:

Can you tell which Share will be affected by the loss of disk 1?

Your volume is RAID-1 (mirrored disks), so there should be no data loss. All shares will lose redundancy when you remove disk 1 (since there will be no mirror). The volume status will change to "degraded" to reflect that.

But you've actually already lost some redundancy, due to the read errors. The logs show that the system is relying on disk 2 to get your data (the log shows the read for disk 1 failing, and the RAID system failing over to read from disk 2). If disk 2 were to fail, there would be at least some data loss.

dstsui wrote:

The Volume log has entries for both RAID 0 and RAID 1.

FYI, you are misreading the log. Volume.log simply numbers the RAID arrays starting from 0. So that entry isn't telling you the RAID mode - it's just a label. A few lines after that label you should see

Level: 1

That tells you the RAID mode is RAID-1.
dstsui
Aspirant
Apr 30, 2021
Removed disk 1 after powering down the NAS and it boots up fine with just disk 2.

I ran NASTester all Shares. The Share with BitRot protection enabled measured 25MB/s. The remaining Shares with BitRot protection disabled measured 36MB/s on average. Is this speed normal at all? I think it is underpar even with BitRot protection off. If so, what else could be wrong? No problem reading or writing multiple files though. Is BitRot protection worth the sacrifice in speed? I have problem with 25MB/s in accessing photos on my photo management software.

I will test disk 1 tomorrow.
StephenB
Guru - Experienced User
Apr 30, 2021
dstsui wrote:

I ran NASTester all Shares. The Share with BitRot protection enabled measured 25MB/s. The remaining Shares with BitRot protection disabled measured 36MB/s on average. Is this speed normal at all?

Both speeds sound low to me. Are you testing over wifi or gigabit ethernet? On gigabit you can get ~70 MB/s read and ~50 MB/s write.

There were some improvements made to the BTRFS setup some years ago. But to take advantage of them you'd need to do a factory reset, rebuild the NAS, and restore all the data from backup. I tested that some time ago, and it did make a significant difference on my RN102.

dstsui wrote:

Is BitRot protection worth the sacrifice in speed?

Up to you really. I do leave it on my main NAS (an RN526x), but I've never had bitrot protection kick in and recover a file. But it's not an RN100 series, and even with bitrot on, it can saturate a fast gigabit connection.
dstsui
Aspirant
Apr 30, 2021
StephenB wrote:
dstsui wrote:
I ran NASTester all Shares. The Share with BitRot protection enabled measured 25MB/s. The remaining Shares with BitRot protection disabled measured 36MB/s on average. Is this speed normal at all?
Both speeds sound low to me. Are you testing over wifi or gigabit ethernet? On gigabit you can get ~70 MB/s read and ~50 MB/s write.
Testing was done over GigE. I have confirmed the LAN hardware is sound before using a direct Cat 6e cable connection (no router and switch) between NAS and PC and verifed with another PC over 5GHz WiFi and the speed in both tests were about the same. It was somewhat lower over WiFi but this was expected. I concluded the common element is the NAS and is the likely culprit. What issues could possibly drag the speed down by that much? Are there any clues in the logs?

A backup of the NAS was run overnight and completed without issues.

I ran WD Dashboard Diganostic Extended Test on disk 1 and it failed with error code 2. Don't know what that means.

I will intall a new disk into the NAS today after checking it with WD Dashboard.
dstsui
Aspirant
Apr 30, 2021
StephenB wrote:
dstsui wrote:
I ran NASTester all Shares. The Share with BitRot protection enabled measured 25MB/s. The remaining Shares with BitRot protection disabled measured 36MB/s on average. Is this speed normal at all?
There were some improvements made to the BTRFS setup some years ago. But to take advantage of them you'd need to do a factory reset, rebuild the NAS, and restore all the data from backup. I tested that some time ago, and it did make a significant difference on my RN102.

When was BTRFS introduced? I had the RN102 for 6 years and I should have checked the speed when I first installed it and defintely would not have accepted it if the speed was so slow back then, which is not a lot better than my old ReadyNAS Duo. BTRFS is unlikely to be the real issue, we are talking about 60% reduction in speed so something fundamental must be wrong.
dstsui
Aspirant
Apr 30, 2021
I disabled BTRFS on all shares and with no background process running, I got the following test results on one of the Shares:

NAS performance tester 1.7 http://www.808.dk/?nastester
Running warmup...
Running a 400MB file write on R: 5 times...
Iteration 1: 52.86 MB/sec
Iteration 2: 58.10 MB/sec
Iteration 3: 56.61 MB/sec
Iteration 4: 52.46 MB/sec
Iteration 5: 58.37 MB/sec
-----------------------------
Average (W): 55.68 MB/sec
-----------------------------
Running a 400MB file read on R: 5 times...
Iteration 1: 33.95 MB/sec
Iteration 2: 34.29 MB/sec
Iteration 3: 30.59 MB/sec
Iteration 4: 29.39 MB/sec
Iteration 5: 33.50 MB/sec
-----------------------------
Average (R): 32.34 MB/sec

The write speed has improved signficantly but the read speed has not improved. Any clues?
dstsui
Aspirant
Apr 30, 2021
dstsui wrote:
I disabled BTRFS on all shares and with no background process running, I got the following test results on one of the Shares:

NAS performance tester 1.7 http://www.808.dk/?nastester
Running warmup...
Running a 400MB file write on R: 5 times...
Iteration 1: 52.86 MB/sec
Iteration 2: 58.10 MB/sec
Iteration 3: 56.61 MB/sec
Iteration 4: 52.46 MB/sec
Iteration 5: 58.37 MB/sec
-----------------------------
Average (W): 55.68 MB/sec
-----------------------------
Running a 400MB file read on R: 5 times...
Iteration 1: 33.95 MB/sec
Iteration 2: 34.29 MB/sec
Iteration 3: 30.59 MB/sec
Iteration 4: 29.39 MB/sec
Iteration 5: 33.50 MB/sec
-----------------------------
Average (R): 32.34 MB/sec

The write speed has improved signficantly but the read speed has not improved. Any clues?

These results above were obtained with Flow Control on the PC NIC disabled. With Flow Control (Tx and Rx) enabled, I got the following results:

Running warmup...
Running a 400MB file write on R: 5 times...
Iteration 1: 30.75 MB/sec
Iteration 2: 30.22 MB/sec
Iteration 3: 30.86 MB/sec
Iteration 4: 31.57 MB/sec
Iteration 5: 30.84 MB/sec
-----------------------------
Average (W): 30.85 MB/sec
-----------------------------
Running a 400MB file read on R: 5 times...
Iteration 1: 33.76 MB/sec
Iteration 2: 34.52 MB/sec
Iteration 3: 34.29 MB/sec
Iteration 4: 33.39 MB/sec
Iteration 5: 33.89 MB/sec
-----------------------------
Average (R): 33.97 MB/sec

I also spotted the following entries in the kernel log around the time when I ran the first test with flow control off:

May 01 12:10:43 nas-36-1C-14 kernel: mvneta d0074000.ethernet eth0: bad rx status 0e830000 (overrun error), size=1024
May 01 12:10:43 nas-36-1C-14 kernel: mvneta d0074000.ethernet eth0: bad rx status 0e830000 (overrun error), size=512
May 01 12:10:43 nas-36-1C-14 kernel: mvneta d0074000.ethernet eth0: bad rx status 0e830000 (overrun error), size=128
May 01 12:23:58 nas-36-1C-14 kernel: nr_pdflush_threads exported in /proc is scheduled for removal
mdgm
Virtuoso
May 01, 2021
What does your initrd.log look like in the logs zip? That will show the firmware update history since the last factory reset of your system.

BTRFS has always been used as the filesystem for the data volume for OS6 and the filesystem for the root volume on the x86 NAS units (the 102 uses EXT4 for the root volume).

Bit-rot protection was added back in 6.2.x

If you ever used bit-rot protection or snapshots, disabling bit-rot protection isn't going to undo all the CoW that has happened. It will just limit what happens going forward.

There are things you could check like CPU usage, whether the memory is using a lot of swap etc. Also turn off services and apps that you don't need.

Over time as updates are released software may consume more resources and the RN100 series is the least powerful systems that run OS6 so any performance hit is going to be most noticeable on those systems.

If you have a bad disk that can impact performance. If you only have the one disk that can also impact performance, and if your NAS is doing a resync of the RAID whilst adding a replacement disk performance will be reduced until that resync completes.
StephenB
Guru - Experienced User
May 01, 2021
IMO you should be careful to tackle one issue at a time. So wait to address the speed issue until you have a full RAID-1 volume back.

I also suggest making a backup of your files before you add the replacement disk back. The rebuild of the RAID will stress the disk 2 (every sector is read and copied to the replacement to create the mirror). If the disk 2 fails during the process, you will lose your data.

dstsui wrote:
When was BTRFS introduced?

Your NAS has always used BTRFS.

My own RN102 goes back to 2013. By 2016 the read speed had dropped to about 35 MB/sec. This was using jbod volumes (so bitrot protection was off, since it needs RAID redundancy). That speed sounds very similar to yours.

A factory reset restored the read speed to around 70 MB/s. I posted the results here, but it was a long time ago, and search isn't finding them. I did another benchmark in 2019, checking all three versions of SMB. The second test was using different disks, and the NAS was set up for XRAID/RAID-1 - not jbod. SMB 2.1 speeds were still around 70 MB/s, though SMB 3 was somewhat slower (63 MB/sec). https://community.netgear.com/t5/Using-your-ReadyNAS-in-Business/Average-transfer-speed-read-and-write-plummeted-after-replacing/m-p/1718589#M178329

So based on my own experience, a factory reset will almost certainly increase your speeds substantially. Yes, it is painful, but the performance increase is likely worth it.

Other things you can try:

Disable strict sync in the SMB settings

disable IPv6 in your NAS network settings. In some cases the PC will use ipv6 link local addresses for data transfer, and that can be slower (esp if your router doesn't have ipv6 enabled).

Disabling all services you don't actually need (and uninstalling any apps you no longer use). The RN102 doesn't have much memory, and over the years the memory footprint of the firmware has increased. Removing services and apps can help mitigate that.

Run the maintenance functions on the volume settings wheel. Balance and Scrub can both help improve performance. Personally I schedule one maintenance function per month - completing the cycle every four months (balance, defrag, scrub, disk test)
StephenB
Guru - Experienced User
May 01, 2021
While I get your desire to solve your original speed problem, I still suggest taking things one step at a time. Data safety comes first, performance comes second.

If I were in your position now (waiting for the new disk to be inserted and resync), I'd be making sure I had an up-to-date backup. Once that was taken care of, I'd run the disk test from the volume settings wheel - in order to confirm that disk 2 didn't have any issues. The disks are about the same age, and it isn't unusual to have multiple disks in a RAID array fail in rapid succession.

dstsui wrote:

I also spotted the following entries in the kernel log around the time when I ran the first test with flow control off:

May 01 12:10:43 nas-36-1C-14 kernel: mvneta d0074000.ethernet eth0: bad rx status 0e830000 (overrun error), size=1024
May 01 12:10:43 nas-36-1C-14 kernel: mvneta d0074000.ethernet eth0: bad rx status 0e830000 (overrun error), size=512
May 01 12:10:43 nas-36-1C-14 kernel: mvneta d0074000.ethernet eth0: bad rx status 0e830000 (overrun error), size=128
May 01 12:23:58 nas-36-1C-14 kernel: nr_pdflush_threads exported in /proc is scheduled for removal

RX overuns happen when the NAS isn't keeping up with the transfer requests. So it's not surprising you'd see them with flow control off, since your NAS isn't capable of transferring at full gigabit speed. The NAS uses flow control requests to pause the traffic when the buffering on the requests starts to get full. The router then will hold them (and at some point will pause the traffic from the PC).

Note that no traffic will be lost due to the overruns - the NAS will request retransmission. TCP (network) speeds will drop though, as the network protocols are designed to back off speeds when any loss occurs.

dstsui wrote:

I disabled BTRFS on all shares and with no background process running, I got the following test results on one of the Shares:

You must mean bitrot, since it is impossible to disable btrfs (it is the file system).
dstsui
Aspirant
May 01, 2021
StephenB wrote:
IMO you should be careful to tackle one issue at a time. So wait to address the speed issue until you have a full RAID-1 volume back.

I also suggest making a backup of your files before you add the replacement disk back. The rebuild of the RAID will stress the disk 2 (every sector is read and copied to the replacement to create the mirror). If the disk 2 fails during the process, you will lose your data.
I already made a backup and it is now going through the array rebuild process. Fingers crossed.

Will certainly try your other suggestions. Not sure I have the appetite to start from scatch and resintall the OS, too much work. If I can get to to say 50MB/s write and 40MB/s read I will be happy.
StephenB
Guru - Experienced User
May 01, 2021
dstsui wrote:
Not sure I have the appetite to start from scatch and resintall the OS, too much work. If I can get to to say 50MB/s write and 40MB/s read I will be happy

FWIW, I did it fairly recently on my RN202 (not for performance reasons). I had about 12 TB of data to reload.

The factory reset itself doesn't take that much of your time unless you have a lot of apps you need to track down and reinstall. Reloading data takes more time, but you really don't have that much. The main issue for me is that the NAS isn't available until the process completes. Though the RN202 is exclusively for back up, it's not my main NAS.

dstsui

Aspirant

May 01, 2021

I backed up the contents in Disk 2 before I inserted a new disk that was quick tested with WD Dashboard into bay 1 at 1:26pm. The system completed rebuild at 8:25pm. Should I be concerned about this message in the system log?

snapperd[12524]: loading 2227 failed

This appears regularly in the log, not just after the rebuild. Nothing of signficance appears in the kernel log (second window) during and after the rebuild was completed.

I initiated a Disk Test after the rebuild was completed and are now awaiting results. Is it worth taking Disk 2 out and run an extended diagnostic test with WD Dashboard just to be safe? Do you think I should replace Disk 2 as well even though it may pass the test?

May 01 20:25:49 nas-36-1C-14 mdadm[1926]: RebuildFinished event detected on md device /dev/md127, component device recovery
May 01 20:25:50 nas-36-1C-14 mdadm[1926]: SpareActive event detected on md device /dev/md127, component device /dev/sdc3
May 01 20:28:59 nas-36-1C-14 connmand[1922]: ntp: adjust (slew): +0.000579 sec
May 01 20:46:04 nas-36-1C-14 connmand[1922]: ntp: adjust (slew): -0.000303 sec
May 01 21:03:09 nas-36-1C-14 connmand[1922]: ntp: adjust (slew): +0.000041 sec
May 01 21:06:26 nas-36-1C-14 dbus[1921]: [system] Activating service name='org.opensuse.Snapper' (using servicehelper)
May 01 21:06:26 nas-36-1C-14 dbus[1921]: [system] Successfully activated service 'org.opensuse.Snapper'
May 01 21:06:26 nas-36-1C-14 snapperd[12524]: loading 2227 failed
May 01 21:06:26 nas-36-1C-14 snapperd[12524]: loading 2227 failed
May 01 21:06:27 nas-36-1C-14 snapperd[12524]: loading 2227 failed
May 01 21:06:27 nas-36-1C-14 snapperd[12524]: loading 2227 failed
May 01 21:06:27 nas-36-1C-14 snapperd[12524]: loading 2227 failed
May 01 21:17:01 nas-36-1C-14 CRON[13015]: pam_unix(cron:session): session opened for user root by (uid=0)
May 01 21:17:01 nas-36-1C-14 CRON[13016]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 01 21:17:01 nas-36-1C-14 CRON[13015]: pam_unix(cron:session): session closed for user root
May 01 21:20:14 nas-36-1C-14 connmand[1922]: ntp: adjust (slew): -0.000513 sec
May 01 21:22:53 nas-36-1C-14 apache2[10157]: [:error] [pid 10157] [client 192.168.1.20:55549] mod_csrf(020): request denied, expired id (12 sec), action=deny, id=-, referer: https://192.168.1.29/admin/index.html
May 01 21:22:53 nas-36-1C-14 apache_access[10155]: 192.168.1.20 "POST /dbbroker HTTP/1.1" 403
May 01 21:22:53 nas-36-1C-14 apache_access[10155]: 192.168.1.20 "GET /admin/index.html HTTP/1.1" 401
May 01 21:22:53 nas-36-1C-14 apache2[10156]: pam_unix(frontview:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=192.168.1.20  user=admin
May 01 21:22:57 nas-36-1C-14 apache2[10380]: [auth_basic:error] [pid 10380] [client 192.168.1.20:55552] AH01617: user admin: authentication failure for "/admin/index.html": Password Mismatch, referer: https://192.168.1.29/admin/index.html
May 01 21:37:19 nas-36-1C-14 connmand[1922]: ntp: adjust (slew): -0.000147 sec
May 01 21:54:24 nas-36-1C-14 connmand[1922]: ntp: adjust (slew): +0.000103 sec
May 01 22:06:33 nas-36-1C-14 dbus[1921]: [system] Activating service name='org.opensuse.Snapper' (using servicehelper)
May 01 22:06:33 nas-36-1C-14 dbus[1921]: [system] Successfully activated service 'org.opensuse.Snapper'
May 01 22:06:34 nas-36-1C-14 snapperd[14319]: loading 2227 failed
May 01 22:06:34 nas-36-1C-14 snapperd[14319]: loading 2227 failed
May 01 22:06:34 nas-36-1C-14 snapperd[14319]: loading 2227 failed
May 01 22:06:34 nas-36-1C-14 snapperd[14319]: loading 2227 failed
May 01 22:06:34 nas-36-1C-14 snapperd[14319]: loading 2227 failed
May 01 22:11:29 nas-36-1C-14 connmand[1922]: ntp: adjust (slew): -0.000351 sec
May 01 22:11:54 nas-36-1C-14 connmand[1922]: ntp: adjust (slew): -0.000969 sec
May 01 22:17:02 nas-36-1C-14 CRON[14952]: pam_unix(cron:session): session opened for user root by (uid=0)
May 01 22:17:02 nas-36-1C-14 CRON[14953]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 01 22:17:02 nas-36-1C-14 CRON[14952]: pam_unix(cron:session): session closed for user root
May 01 22:22:35 nas-36-1C-14 apache_access[10155]: Admin session from 192.168.1.20 ended after 2:09:34
May 01 22:28:58 nas-36-1C-14 connmand[1922]: ntp: adjust (slew): +0.000615 sec
May 01 22:46:03 nas-36-1C-14 connmand[1922]: ntp: adjust (slew): -0.000185 sec
May 01 23:03:08 nas-36-1C-14 connmand[1922]: ntp: adjust (slew): -0.000433 sec
May 01 23:06:40 nas-36-1C-14 dbus[1921]: [system] Activating service name='org.opensuse.Snapper' (using servicehelper)
May 01 23:06:40 nas-36-1C-14 dbus[1921]: [system] Successfully activated service 'org.opensuse.Snapper'
May 01 23:06:41 nas-36-1C-14 snapperd[16257]: loading 2227 failed
May 01 23:06:41 nas-36-1C-14 snapperd[16257]: loading 2227 failed
May 01 23:06:41 nas-36-1C-14 snapperd[16257]: loading 2227 failed
May 01 23:06:41 nas-36-1C-14 snapperd[16257]: loading 2227 failed
May 01 23:06:41 nas-36-1C-14 snapperd[16257]: loading 2227 failed
May 01 23:17:01 nas-36-1C-14 CRON[16785]: pam_unix(cron:session): session opened for user root by (uid=0)
May 01 23:17:01 nas-36-1C-14 CRON[16786]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 01 23:17:01 nas-36-1C-14 CRON[16785]: pam_unix(cron:session): session closed for user root
May 01 23:20:13 nas-36-1C-14 connmand[1922]: ntp: adjust (slew): -0.000199 sec
May 01 23:37:18 nas-36-1C-14 connmand[1922]: ntp: adjust (slew): +0.000127 sec
May 01 23:44:33 nas-36-1C-14 apache_access[10155]: Admin session started from 192.168.1.20
May 01 23:44:37 nas-36-1C-14 apache_access[10155]: Suppressed 2 duplicate messages
May 01 23:44:37 nas-36-1C-14 apache_access[10155]: 192.168.1.20 "POST /dbbroker HTTP/1.1" 401
May 01 23:45:55 nas-36-1C-14 apache_access[10155]: Suppressed 2 duplicate messages
May 01 23:45:55 nas-36-1C-14 apache_access[10155]: 192.168.1.20 "GET /ispdb.xml?_dc=1619878554438 HTTP/1.1" 200
May 01 23:52:27 nas-36-1C-14 dbus[1921]: [system] Activating service name='org.opensuse.Snapper' (using servicehelper)
May 01 23:52:28 nas-36-1C-14 dbus[1921]: [system] Successfully activated service 'org.opensuse.Snapper'
May 01 23:52:28 nas-36-1C-14 snapperd[18410]: loading 2227 failed
May 01 23:52:28 nas-36-1C-14 snapperd[18410]: loading 2227 failed
May 01 23:52:28 nas-36-1C-14 snapperd[18410]: loading 2227 failed
May 01 23:52:28 nas-36-1C-14 snapperd[18410]: loading 2227 failed
May 01 23:52:28 nas-36-1C-14 snapperd[18410]: loading 2227 failed
May 01 23:52:29 nas-36-1C-14 rnutil[18439]: Failed to get tier from /dev/md0 : 25:

May 01 13:25:57 nas-36-1C-14 kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x4250000 action 0xe frozen
May 01 13:25:57 nas-36-1C-14 kernel: ata1: irq_stat 0x80000040, connection status changed
May 01 13:25:57 nas-36-1C-14 kernel: ata1: SError: { PHYRdyChg CommWake BadCRC DevExch }
May 01 13:25:57 nas-36-1C-14 kernel: ata1: hard resetting link
May 01 13:26:05 nas-36-1C-14 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May 01 13:26:06 nas-36-1C-14 kernel: ata1.00: ATA-9: WDC WD20EFRX-68EUZN0, 82.00A82, max UDMA/133
May 01 13:26:06 nas-36-1C-14 kernel: ata1.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
May 01 13:26:06 nas-36-1C-14 kernel: ata1.00: configured for UDMA/133
May 01 13:26:06 nas-36-1C-14 kernel: ata1: EH complete
May 01 13:26:06 nas-36-1C-14 kernel: scsi 0:0:0:0: Direct-Access     ATA      WDC WD20EFRX-68E 0A82 PQ: 0 ANSI: 5
May 01 13:26:06 nas-36-1C-14 kernel: sd 0:0:0:0: [sdc] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
May 01 13:26:06 nas-36-1C-14 kernel: sd 0:0:0:0: [sdc] 4096-byte physical blocks
May 01 13:26:06 nas-36-1C-14 kernel: sd 0:0:0:0: [sdc] Write Protect is off
May 01 13:26:06 nas-36-1C-14 kernel: sd 0:0:0:0: [sdc] Mode Sense: 00 3a 00 00
May 01 13:26:06 nas-36-1C-14 kernel: sd 0:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
May 01 13:26:06 nas-36-1C-14 kernel: sd 0:0:0:0: Attached scsi generic sg3 type 0
May 01 13:26:06 nas-36-1C-14 kernel: sd 0:0:0:0: [sdc] Attached SCSI disk
May 01 13:26:08 nas-36-1C-14 kernel:  sdc: sdc1
May 01 13:26:09 nas-36-1C-14 kernel: md: bind<sdc1>
May 01 13:26:09 nas-36-1C-14 kernel: RAID1 conf printout:
May 01 13:26:09 nas-36-1C-14 kernel:  --- wd:1 rd:2
May 01 13:26:09 nas-36-1C-14 kernel:  disk 0, wo:1, o:1, dev:sdc1
May 01 13:26:09 nas-36-1C-14 kernel:  disk 1, wo:0, o:1, dev:sda1
May 01 13:26:09 nas-36-1C-14 kernel: md: recovery of RAID array md0
May 01 13:26:09 nas-36-1C-14 kernel: md: minimum _guaranteed_  speed: 30000 KB/sec/disk.
May 01 13:26:09 nas-36-1C-14 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
May 01 13:26:09 nas-36-1C-14 kernel: md: using 128k window, over a total of 4190208k.
May 01 13:26:14 nas-36-1C-14 kernel: md: bind<sdc3>
May 01 13:26:14 nas-36-1C-14 kernel: RAID1 conf printout:
May 01 13:26:14 nas-36-1C-14 kernel:  --- wd:1 rd:2
May 01 13:26:14 nas-36-1C-14 kernel:  disk 0, wo:1, o:1, dev:sdc3
May 01 13:26:14 nas-36-1C-14 kernel:  disk 1, wo:0, o:1, dev:sda3
May 01 13:26:14 nas-36-1C-14 kernel: md: recovery of RAID array md127
May 01 13:26:14 nas-36-1C-14 kernel: md: minimum _guaranteed_  speed: 30000 KB/sec/disk.
May 01 13:26:14 nas-36-1C-14 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
May 01 13:26:14 nas-36-1C-14 kernel: md: using 128k window, over a total of 1948662784k.
May 01 13:26:16 nas-36-1C-14 kernel: md1: detected capacity change from 535822336 to 0
May 01 13:26:16 nas-36-1C-14 kernel: md: md1 stopped.
May 01 13:26:16 nas-36-1C-14 kernel: md: unbind<sda2>
May 01 13:26:16 nas-36-1C-14 kernel: md: export_rdev(sda2)
May 01 13:26:17 nas-36-1C-14 kernel: md: bind<sda2>
May 01 13:26:17 nas-36-1C-14 kernel: md: bind<sdc2>
May 01 13:26:17 nas-36-1C-14 kernel: md/raid1:md1: not clean -- starting background reconstruction
May 01 13:26:17 nas-36-1C-14 kernel: md/raid1:md1: active with 2 out of 2 mirrors
May 01 13:26:17 nas-36-1C-14 kernel: md1: detected capacity change from 0 to 535822336
May 01 13:26:17 nas-36-1C-14 kernel: md: resync of RAID array md1
May 01 13:26:17 nas-36-1C-14 kernel: md: minimum _guaranteed_  speed: 30000 KB/sec/disk.
May 01 13:26:17 nas-36-1C-14 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
May 01 13:26:17 nas-36-1C-14 kernel: md: using 128k window, over a total of 523264k.
May 01 13:26:20 nas-36-1C-14 kernel: Adding 523260k swap on /dev/md1.  Priority:-1 extents:1 across:523260k 
May 01 13:26:41 nas-36-1C-14 kernel: md: md1: resync done.
May 01 13:26:42 nas-36-1C-14 kernel: RAID1 conf printout:
May 01 13:26:42 nas-36-1C-14 kernel:  --- wd:2 rd:2
May 01 13:26:42 nas-36-1C-14 kernel:  disk 0, wo:0, o:1, dev:sda2
May 01 13:26:42 nas-36-1C-14 kernel:  disk 1, wo:0, o:1, dev:sdc2
May 01 13:28:20 nas-36-1C-14 kernel: md: md0: recovery done.
May 01 13:28:20 nas-36-1C-14 kernel: RAID1 conf printout:
May 01 13:28:20 nas-36-1C-14 kernel:  --- wd:2 rd:2
May 01 13:28:20 nas-36-1C-14 kernel:  disk 0, wo:0, o:1, dev:sdc1
May 01 13:28:20 nas-36-1C-14 kernel:  disk 1, wo:0, o:1, dev:sda1
May 01 20:25:48 nas-36-1C-14 kernel: md: md127: recovery done.
May 01 20:25:48 nas-36-1C-14 kernel: RAID1 conf printout:
May 01 20:25:48 nas-36-1C-14 kernel:  --- wd:2 rd:2
May 01 20:25:48 nas-36-1C-14 kernel:  disk 0, wo:0, o:1, dev:sdc3
May 01 20:25:48 nas-36-1C-14 kernel:  disk 1, wo:0, o:1, dev:sda3

StephenB
Guru - Experienced User
May 01, 2021
dstsui wrote:

Is it worth taking Disk 2 out and run an extended diagnostic test with WD Dashboard just to be safe? Do you think I should replace Disk 2 as well even though it may pass the test?

The disk test will run a full read test of both disks, so there is no need to do anything with the Dashboard program.

Personally I leave my disks in place until I see some evidence that they are misbehaving. So I'd leave disk 2 alone for now.

dstsui wrote:

Should I be concerned about this message in the system log?

snapperd[12524]: loading 2227 failed

That is a bit concerning, it suggests something isn't quite right with the file system (indicating that a snapshot can't be loaded).

Are you using the "smart" snapshots? I don't like them myself, because some of them are retained forever (and the disk space that needs gradually goes up). I use custom snapshots, and explicitly set retention. Most shares are set to 3 months.

Anyway, I suggest deleting all your current snapshots from the web ui. After that try doing a scrub from the volume settings wheel after the disk test completes. Then see what file system errors show up in the log as a result of the scrub.

Or do the factory default, which will obviously take some time, but will also give you a completely clean file system without needing to do more analysis and repair.
dstsui
Aspirant
May 01, 2021
The disk test finished without drama. Where can I see the resutls?

Some of the oldest snapshots date back to Jan 2015. I guess it is safe to delete Shares older than say one year. I will run a Scub after I deleted the old shares.
dstsui
Aspirant
May 02, 2021
I noticed something interesting in the snapshot history. Before March snapshots were taken once a month but since March they were taken daily. The oldest snapshots date back to the time when I set up the system in Jan 2015, shouldn't they be purged already?

Tonight I copied 489 images from an SD card on my PC to the RN102 and it clocked a whopping speed of 10MB/s. At least it finished without crashing, but something has to be done. I wondered whether all these legacy snapshots are weighing it down.
mdgm
Virtuoso
May 02, 2021
Does your system show how much space is consumed by snapshots? If so you have quotas enabled on your data volume and disabling that may help.

By default monthly snapshots are kept indefinitely unless there is limited free space on the data volume

A custom snapshot schedule only keeping a more limited number of snapshots would likely lead to better performance too as StephenB mentioned.

Note that if you delete a snapshot all newer snapshots are recursively updated. This can take a while to complete so performance would get worse before it improves.
StephenB
Guru - Experienced User
May 03, 2021
mdgm wrote:

By default monthly snapshots are kept indefinitely unless there is limited free space on the data volume

A custom snapshot schedule only keeping a more limited number of snapshots would likely lead to better performance too as StephenB mentioned.

Note that if you delete a snapshot all newer snapshots are recursively updated. This can take a while to complete so performance would get worse before it improves.

Also, your error is saying that one particular snapshot is somehow damaged and can't be read. We don't know how old that is - and you might not be able to delete it. So there are multiple issues here - the bad disk was one, but the error with the snapshot suggests some file system corruption.

It's possible that the poor performance will continue after you get rid of the ancient snapshots, but even if it does, you need to address the file system corruption.

Overall, the direct way is to do the factory default, set up the NAS again, and restore the files from backup. Though painful, that is guaranteed to clean up everything.

But if you really can't do that, then look for more file system errors after you delete the old snapshots, and look again after the scrub. If they are gone, follow up with a balance (also from the volume settings wheel). Then measure performance again.
dstsui
Aspirant
May 03, 2021
I could delete all Shares without issues. I started a disk scrub last night and when I checked the status this morning, it has only progressed 2.5%, if the rate is constant, the scrub won't finish for another 13 days. Is it meant to be that slow?
StephenB
Guru - Experienced User
May 04, 2021
dstsui wrote:

I started a disk scrub last night and when I checked the status this morning, it has only progressed 2.5%, if the rate is constant, the scrub won't finish for another 13 days. Is it meant to be that slow?

The rate might not be constant, but it is slower than I'd expect.

I did a scrub on my RN102 (which has a 2x1TB volume) about a month ago, and it took about 5 1/2 hours. So I'd expect about 11-12 hours for 2x2TB.

The scrub is doing two different operations in parallel - one is a RAID scrub (which is the same as a resync), the second is a BTRFS scrub. The BTRFS scrub is what we're aiming for here - how long it takes might be longer because you've never run one.

In any event, I'd wait for it to complete. You could download a fresh set of logs, and see if there are any new disk errors (or btrfs errors).

dstsui wrote:

I could delete all Shares without issues.

Do you mean Snapshots, or did you delete the Shares?

Forum Discussion

RN102 broken after OS 6.10.4 update

Related Content

Wifi broken after firmware update

ReadyNAS 2304 6.10.4 Firmware has broken Volume data Rsynch backups

ReadyNAS 2304 6.10.4 Firmware has broken Volume data Rsynch backups

App broken?

apt.readynas.com - Broken repo?

NETGEAR Academy

ProSupport for Business