NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.
Forum Discussion
tony359
Mar 30, 2023Apprentice
ReadyNAS Pro 6 crashed again
Hello all, My ReadyNAS Pro6 periodically stops responding to the network. When that happens I can push the button to shutdown it but it will sit on "shutting down" forever and then I'll have to p...
tony359
Jun 09, 2023Apprentice
Considering it’s happening very rarely it’s not such a bad idea.
I’ve run overnights of ram tests but maybe it didn’t catch it because it happens very rarely.
I still have the original CPU so I could try that too.
That said, the fact that just the network went down last time is suspicious. A ram or cpu issue would have much bigger impact I reckon. I might want to put a switch in between the nas and the main switch. It’s always been that switch and maybe it’s faulty. After all the nas stopped crashing when I took it off the main network - which takes the main switch out of the equation.
And it worked for a while while connected to my main desktop, again no main switch involved.
Uhm… I like this idea 🙂
I’ve run overnights of ram tests but maybe it didn’t catch it because it happens very rarely.
I still have the original CPU so I could try that too.
That said, the fact that just the network went down last time is suspicious. A ram or cpu issue would have much bigger impact I reckon. I might want to put a switch in between the nas and the main switch. It’s always been that switch and maybe it’s faulty. After all the nas stopped crashing when I took it off the main network - which takes the main switch out of the equation.
And it worked for a while while connected to my main desktop, again no main switch involved.
Uhm… I like this idea 🙂
StephenB
Jun 09, 2023Guru
tony359 wrote:
I might want to put a switch in between the nas and the main switch. 🙂
Makes sense. You could also swap the two connections, and see if the problem moves.
- SandsharkJun 09, 2023Sensei
Is it a "green" switch? I've had a couple issues with ReadyNAS and green switches, though I've believed the problem units already had partly damaged LAN ports. My main switch has a "green" on/off selection. Try turning off power saving mode if yours does. Otherwise, a non-green switch in between might be the answer.
- tony359Jun 09, 2023Apprentice
It's a Netgear! 🙂
GS108Ev2. "partly" managed. V1.00.12 (latest). DHCP disabled. DHCP is handled by the router (Fritzbox) which issues the same IP to the NAS MAC address. All settings are default to be honest.
I have tried a static IP in the past with no change - though, I'm confident those swollen capacitors might have contributed to SOME of the issues I was having.
Today's new issue is... the NAS is online, I can see the files. I can SSH into it. But web interface shows an "500 - internal server error". This is on both ports. Sigh 🙂
Before I just reboot the box, how would I restart the web interface from SSH?
I'll install a dumb switch between the NAS and the main switch - with new cables.
The 7600 seems to be a good option. It's only 2 cores but it's faster than the cores in the 6600. I wonder how much a NAS used as a "file system" is actually using a multi-core CPU. And the 7600 as you say is cooler.
I think I'll fix this issue first then I might try the 7600 as well, thanks for the hint!
- tony359Jun 09, 2023Apprentice
I feel that the below is relevant with my issue. Again, the NAS is accessible, I can write a file on the data folder via nano. I just lost the web interface.
These weird failures are incredibly annoying. I'd like to test what itachi2 recommended, can someone possibly point me to the right direction? See https://community.netgear.com/t5/Using-your-ReadyNAS-in-Business/ReadyNAS-Pro-6-crashed-again/m-p/2316638/highlight/true#M199640
Thanks 🙂
root@Enterprise-NAS:/# systemctl status apache2 Failed to get properties: Activation of org.freedesktop.systemd1 timed out root@Enterprise-NAS:/# root@Enterprise-NAS:/# root@Enterprise-NAS:/# root@Enterprise-NAS:/# root@Enterprise-NAS:/# systemctl restart apache2 Failed to restart apache2.service: Activation of org.freedesktop.systemd1 timed out See system logs and 'systemctl status apache2.service' for details. root@Enterprise-NAS:/# sudo systemctl status apache2 -bash: sudo: command not found root@Enterprise-NAS:/# su root@Enterprise-NAS:/# systemctl status apache2 Failed to get properties: Activation of org.freedesktop.systemd1 timed out root@Enterprise-NAS:/# systemctl status readynasd Failed to get properties: Activation of org.freedesktop.systemd1 timed out root@Enterprise-NAS:/# ps aux | grep readynasd root 26625 0.0 0.0 17836 1008 pts/2 S+ 19:58 0:00 grep readynasd root@Enterprise-NAS:/# service ctscand stop Failed to stop ctscand.service: Connection timed out See system logs and 'systemctl status ctscand.service' for details. Failed to get load state of ctscand.service: Connection timed out root@Enterprise-NAS:/# systemctl restart readynasd Failed to restart readynasd.service: Activation of org.freedesktop.systemd1 timed out See system logs and 'systemctl status readynasd.service' for details. root@Enterprise-NAS:/# systemctl status readynasd.service Failed to get properties: Activation of org.freedesktop.systemd1 timed out
- KDSJun 09, 2023Guide
Just another hardware thing that has probably already happened.
1. After good PSU installed was CMOS cleared?
2. Has CMOS battery been checked?
3. Are you keeping it simple with just 1 HDD, possibly 2 (raid 1), with HDDs especially raid arrays cleaned and cleared on another PC prior to installing. Granted you may have data on your system, though remove those HDDs and start fresh, with known clean and good drives? I tested with some old 320GB junk drives I had kicking about. I also encountered NIC, web access, and HDD problems prior to replacing the PSU. My original 7200 WD HDDs were only seen as 5900, then when I added a newer 7200 WD HDD it was seen as 7200, it did not like the mismatch in HDD speed that it saw.
Though finally did clean HDDs. I think Web interface may be associated with what is already on the HDDs.
My HDD and hardware issues were resolved when I replaced PSU. Both types drives 3 x 7200 seen as 5900, and 3 x 7200 seen as 7200, and running together fine.
4. BTW are you using RAIDar 6.5.0.
- tony359Jun 09, 2023Apprentice
Just another hardware thing that has probably already happened.
1. After good PSU installed was CMOS cleared?
--
No, I did not update the BIOS so I didn't think of clearing the CMOS. I can try.
2. Has CMOS battery been checked?
--
No. Good point.
3. Are you keeping it simple with just 1 HDD, possibly 2 (raid 1), with HDDs especially raid arrays cleaned and cleared
--
No. Reason is: last time the system behaved, it lasted for 2 months. I cannot stay without my data for 2 months.
The only two options here are
a. Fix it with the current setup
b. try a factory default and migrate a backup
Testing with 2 random HDDs is likely not gaining any evidence I'm afraid.
I also encountered NIC, web access, and HDD problems prior to replacing the PSU. My original 7200 WD HDDs were only seen as 5900, then when I added a newer 7200 WD HDD it was seen as 7200, it did not like the mismatch in HDD speed that it saw.
--
Unfortunately the replacement PSU did not solve all the problems. I'm confident some of the issues I experienced were caused by the bad PSU but the NAS is still misbehaving I'm afraid.
All my HDDs are WD RED, 5400-ish (4TB are a bit slower than the 6TB).
4. BTW are you using RAIDar 6.5.0.
--
No. I am on OS6.
I appreciate a factory reset would be a good idea but I have 13TB on that NAS and I don't know where to store them for a backup. Yes, the NAS is more or less fully backed up (locally and online) but it would take me forever to restore those backups so I'd consider that as an emergency option only.
I could see if I could hire another NAS, transfer the data, reset and restore. But somehow I am not confident my problems would go away 🙂
Thanks for your input!
- tony359Jun 10, 2023Apprentice
Little update.
I checked the battery, it's ok, 3.1V. I replaced it some time ago when I serviced the box.
I re-reset the BIOS (only thing I change is the default fan speed!)
I swapped position of HDD0 with HDD4. I sprayed dry contact cleaner on the backplane and on the HDDs, cleaned with a small q-tip.
Once the NAS was powered up again, HDD0 failed to show up on the BIOS splash page straight away. So it's not the HDD and, to be honest, I feel that that might be a red herring. I never had issues with HDD0 so maybe it's a BIOS bug which then does not affect the software. No idea. But I now know it's not the drive.
I've added a TP-Link switch between the main switch and the NAS.
Next: throwing the NAS out of the window.
- tony359Jun 11, 2023Apprentice
And no, the NAS disappeared again.
Solution: SSH into other port and ifconfig the other port DOWN and then UP again.
I could try swapping the config but I think I tried that in the past already.
If someone could give me some directions for checking the HDDs offline as mentioned above, that would be great! 🙂
Thanks
- tony359Jun 11, 2023Apprentice
That's what I meant with "swapping the config" sorry. As in swap the IP addresses between ports.
I'll try but I think I tried that in the past already. 100% worth a try.
- tony359Jun 12, 2023Apprentice
The NICs are on two different IP range - one main network, one PC only.
What used to be on main network is now directly connected to the PC and what used to be connected to the PC is now connected to the main network and I've swapped the IP addresses accordingly.
I did that yesterday and I've just checked: NAS has disappeared. Sigh!
I SSH'd through the other NIC, restarted it and it worked as usual.
So
- It's not the specific NIC
- It's not the switch
It's curious that it's always the NIC on the main network failing and not the other.
Help 🙂
- schumakuJun 12, 2023Guru
As you are in the lucky situation having an alternate LAN interface (and IP subnet) available. what does the kernel output show when the device "disappeared", ...?
# dmesg
The risk that a network adapter does become flakey is very small. More typical, the adapters resp. the data connectivity does completely disappear completely, and the UPnP OS does no longer detect the adapter.
Most problems on such NASes are caused by RAID becoming inoperable, due to aged or breaking storage blocks.
Do you have a known working, reliable SATA storage block at hand to set-up the NAS with one single device volume, or two on a RAID 1 volume? Remove the potentially unhealthy storage blocks, and restart a test from scratch.
- tony359Jun 12, 2023Apprentice
Thanks, I'll test next time.
Many have (rightly) recommended a test with a couple of random HDDs. I have plenty so that wouldn't be an issue.
My concern is that sometimes the NAS stays online for weeks without issues and I really cannot keep my data offline for so long.
Is there a way to do an offline test of my drives? Someone recommended booting from a Debian Live-USB but I would need some minor guidance on that. I know how to make the USB, I'm just making sure (as much as possible) I don't do anything that can destroy my data.
Thanks! 🙂
- schumakuJun 12, 2023Guru
Start with retrieving the SMART data from the storage block (aka. disk). Next trigger a full SMART check (rapid, then full) of the storage block. Then retrieve the SMART data again.
You can do this on any platform, without erasing or re-partition or re-format the storage block - if done carefully of course.
- tony359Jun 12, 2023ApprenticeThanks.
I’ll Google how to do that. 🙂
Just to double check: do you mean doing those checks on the NAS itself while it’s online? - StephenBJun 13, 2023Guru
tony359 wrote:
Is there a way to do an offline test of my drives? 🙂
There is an on-line test in the maintenance menu you can use. That runs the full built-in smart test on all the drives in the volume.
You can also use smartctl -x /dev/sda from ssh to see more errors (UNCs in particular) on sda (or whatever disk you wish),
As far as off-line goes, the simplest way is to connect the drive to a Windows PC and run the vendor diag - Dashboard for WDC, and Seatools for Seagate. Unfortunately they don't run on MacOS.
But it seems to me that your symptoms are pointing either to the switch or perhaps the cable going from the NAS to the switch. It's always the NIC port connected to that switch that fails, and the other NIC always continues to work fine.
- tony359Jun 13, 2023Apprentice
Hi Stephen,
No, the ports were swapped last time - also the switch and the cable. So it's not a NIC or Network issue. Well. It ALWAYS fails on that NETWORK so it could be something on my main network. But on this occasion the NAS was wired to the main switch on another port and through an additional switch. So if it's something with that network, it's not a HW issue.
The online maintenance runs periodically. The logs show an "offline" test though. How should I read that? The drive is now 51888hrs.
SMART Extended Self-test Log Version: 1 (1 sectors) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Interrupted (host reset) 90% 50227 - # 2 Extended offline Completed without error 00% 48081 - # 3 Extended offline Completed without error 00% 45875 - # 4 Extended offline Completed without error 00% 43691 - # 5 Extended offline Completed without error 00% 41536 - # 6 Extended offline Completed without error 00% 39834 - # 7 Extended offline Completed without error 00% 37636 - # 8 Extended offline Completed without error 00% 35455 - # 9 Extended offline Completed without error 00% 33273 - #10 Extended offline Completed without error 00% 31118 - #11 Extended offline Completed without error 00% 28912 - #12 Extended offline Completed without error 00% 26707 - #13 Extended offline Completed without error 00% 24525 - #14 Extended offline Completed without error 00% 22554 - #15 Extended offline Completed without error 00% 20712 - #16 Extended offline Completed without error 00% 19182 - #17 Short offline Completed without error 00% 82 - #18 Short offline Completed without error 00% 63 -
I ran smartctl -x in the past and posted the output here earlier on this thread. I didn't spot anything but I am not an expert. There are UNC errors on SDA (which I now moved to SDE) but at 7872 hours, a few years ago! 🙂
Error 159 [14] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 4b 2b cc 40 40 00 Error: WP at LBA = 0x4b2bcc40 = 1261161536 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 61 04 00 00 08 00 00 4b 2b c8 40 40 08 14:59:14.849 WRITE FPDMA QUEUED 60 04 00 00 00 00 00 4b 2b cc 40 40 08 14:59:14.849 READ FPDMA QUEUED ef 00 10 00 02 00 00 00 00 00 00 a0 08 14:59:14.849 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 14:59:14.849 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 14:59:14.849 IDENTIFY DEVICE Error 158 [13] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 4b 2b cc 40 40 00 Error: UNC at LBA = 0x4b2bcc40 = 1261161536 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 04 00 00 08 00 00 4b 2b cc 40 40 08 14:59:11.031 READ FPDMA QUEUED 61 04 00 00 00 00 00 4b 2b c8 40 40 08 14:59:11.031 WRITE FPDMA QUEUED ef 00 10 00 02 00 00 00 00 00 00 a0 08 14:59:11.031 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 14:59:11.031 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 14:59:11.030 IDENTIFY DEVICE Error 157 [12] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 4b 2b cc 40 40 00 Error: WP at LBA = 0x4b2bcc40 = 1261161536 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 61 04 00 00 08 00 00 4b 2b c8 40 40 08 14:59:07.223 WRITE FPDMA QUEUED 60 04 00 00 00 00 00 4b 2b cc 40 40 08 14:59:07.223 READ FPDMA QUEUED ef 00 10 00 02 00 00 00 00 00 00 a0 08 14:59:07.223 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 14:59:07.223 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 14:59:07.223 IDENTIFY DEVICE Error 156 [11] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 4b 2b cc 40 40 00 Error: UNC at LBA = 0x4b2bcc40 = 1261161536 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 04 00 00 08 00 00 4b 2b cc 40 40 08 14:59:03.405 READ FPDMA QUEUED 61 04 00 00 00 00 00 4b 2b c8 40 40 08 14:59:03.405 WRITE FPDMA QUEUED ef 00 10 00 02 00 00 00 00 00 00 a0 08 14:59:03.405 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 14:59:03.405 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 14:59:03.405 IDENTIFY DEVICE Error 155 [10] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 4b 2b cc 40 40 00 Error: WP at LBA = 0x4b2bcc40 = 1261161536 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 61 04 00 00 08 00 00 4b 2b c8 40 40 08 14:58:59.720 WRITE FPDMA QUEUED 60 04 00 00 00 00 00 4b 2b cc 40 40 08 14:58:59.720 READ FPDMA QUEUED ef 00 10 00 02 00 00 00 00 00 00 a0 08 14:58:59.720 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 14:58:59.720 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 14:58:59.719 IDENTIFY DEVICE Error 154 [9] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 4b 2b cc 40 40 00 Error: UNC at LBA = 0x4b2bcc40 = 1261161536 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 04 00 00 08 00 00 4b 2b cc 40 40 08 14:58:55.900 READ FPDMA QUEUED 61 04 00 00 00 00 00 4b 2b c8 40 40 08 14:58:55.900 WRITE FPDMA QUEUED ea 00 00 00 00 00 00 00 00 00 00 e0 08 14:58:55.873 FLUSH CACHE EXT 60 00 08 00 08 00 00 00 7f 22 18 40 08 14:58:55.838 READ FPDMA QUEUED 61 00 02 00 00 00 00 00 00 00 48 40 08 14:58:55.838 WRITE FPDMA QUEUED Error 153 [8] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 4b 2b c8 40 40 00 Error: UNC at LBA = 0x4b2bc840 = 1261160512 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 04 00 00 00 00 00 4b 2b c8 40 40 08 14:58:52.283 READ FPDMA QUEUED ef 00 10 00 02 00 00 00 00 00 00 a0 08 14:58:52.283 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 14:58:52.283 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 14:58:52.282 IDENTIFY DEVICE ef 00 03 00 46 00 00 00 00 00 00 a0 08 14:58:52.282 SET FEATURES [Set transfer mode] Error 152 [7] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 4b 2b c8 40 40 00 Error: UNC at LBA = 0x4b2bc840 = 1261160512 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 04 00 00 00 00 00 4b 2b c8 40 40 08 14:58:48.786 READ FPDMA QUEUED ef 00 10 00 02 00 00 00 00 00 00 a0 08 14:58:48.786 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 14:58:48.786 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 14:58:48.786 IDENTIFY DEVICE ef 00 03 00 46 00 00 00 00 00 00 a0 08 14:58:48.786 SET FEATURES [Set transfer mode]
I am Windows so that's fine but wouldn't be better to run the tests on a Linux system so the file system can be checked as well? Also I think I think I'd prefer the disks to stay unmounted so I know I have less chances of damaging the RAID.
Can I start the NAS from a Debian live-USB? I could run the checks from there, assuming VGA works there. And what do you think of that suggestion of running btrfs-check on the drives? I don't dislike the idea of checking the file system.
The NAS disappeared again so I've run dmseg and it's attached (this forum lacks the ability to attach text files!).
Do I see lots of network going down messages after what seems to be a gap? And both ETH0 and ETH1.
Disabling and re-enabling ETH0 worked as usual.
And yes, I've now disabled IPv6 (it got re-enabled when I swapped the IPs I think)
- SandsharkJun 13, 2023Sensei
Yes, you can start a legacy NAS from a Debian Live USB (or even DOS or Windows). Native OS6 models are more picky about what they will start from.
- StephenBJun 13, 2023Guru
tony359 wrote:
I am Windows so that's fine but wouldn't be better to run the tests on a Linux system so the file system can be checked as well? Also I think I think I'd prefer the disks to stay unmounted so I know I have less chances of damaging the RAID.
I don't think so. If you needed that, I'd do it in the NAS.
I really don't see how this can be the disks or the file system. If it were, the second NIC wouldn't be responsive when the problem occurs. Plus normal operation wouldn't resume when you set the interface down and then up again.
tony359 wrote:
No, the ports were swapped last time - also the switch and the cable. So it's not a NIC or Network issue. Well. It ALWAYS fails on that NETWORK so it could be something on my main network.
I think definitely a network issue, though perhaps not the physical layer. The puzzle is what.
Are you using the NAS differently on the main network than you are on the PC connection?
The history here is of course extensive, and I'm have trouble keeping everything straight. Did the NAS ever lock up when it was only connected to the main network (with the PC NIC disconnected)?
tony359 wrote:
The online maintenance runs periodically. The logs show an "offline" test though. How should I read that? The drive is now 51888hrs.
The "extended offline" record is actually the test you run from the maintenance settings. No idea why is it described as "offline" by smartctl.
You should also see it at the end of volume.log. It looks like the NAS crashed (or was shut down) before the most recent test finished.
- tony359Jun 13, 2023Apprentice
>I don't think so. If you needed that, I'd do it in the NAS.
>I really don't see how this can be the disks or the file system. If it were, the second NIC wouldn't be responsive when the >problem occurs. Plus normal operation wouldn't resume when you set the interface down and then up again.
I appreciate your view and I don't disagree with it.
But this has been going on for months and I've tried many things short of a new set of HDDs.
Before I start messing up with my data I'd like to exhaust all the options.
One of them is to do an offline check via Live-CD. As I am not super-skilled with Linux and I care about my data, can someone roughly guide me so I don't obliterate my data 🙂
I guess I'll boot from a Live USB, the 5 RAID HDDs are not going to be mounted by default.
I can then run
btrfs-check --readonly /dev/sd(x)
This should check the file system?
Then smartctl -t long /dev/sd(x)
Anything else anybody can think I should do while the HDDs are offline?
>I think definitely a network issue, though perhaps not the physical layer. The puzzle is what.
>Are you using the NAS differently on the main network than you are on the PC connection?
>The history here is of course extensive, and I'm have trouble keeping everything straight. Did the NAS ever lock up >when it was only connected to the main network (with the PC NIC disconnected)?
The PC and the NAS are plugged on the same switch. There is nothing running on the NAS. I only use it as File System.
I appreciate the history is long and I thank you for bearing with me for so long and not suggesting I should go buy a Qnap 🙂
The second NIC connected to the PC is a recent addition as I discovered that when the NAS disappears I can still access it via the other NIC. The behaviour hasn't changed since I also plugged the PC directly into the NAS.
Months ago, the NAS stopped misbehaving when I completely disconnected it from ANY networks.
A week later I plugged it into the PC only (no main network, no internet)
Some weeks of good behaviour later, I put the NAS back on the main network, removing some port forwarding I had in the main router.
It worked PERFECTLY for 2 months.
Then it started disappearing twice a day. Out of the blue.
This is why I am pursuing unlikely routes: the above events point to NOTHING! 🙂
- tony359Jun 13, 2023Apprentice
quick addendum:
I've made a live-USB of Debian, played with it and a random HDD which I formatted btrfs.
If anybody has any suggestions on what to test while offline, please do let me know!
Also, if someone has any suggestions on what NOT to do while playing with those HDD, please also do let me know!
- StephenBJun 13, 2023Guru
You'd need to assemble the RAID array and mount it in order to run btrfs check.
Since your system boots, you can just run ssh (logging in as root), and run btrfs check from there. The device would be /dev/md127 (the raid array virtual disk).
Use --force because the file system is mounted. It won't try to repair anything, so no need to worry about read-only. Don't write anything to the data volume while it is running.
root@RN102:~# btrfs check --force /dev/md127 WARNING: filesystem mounted, continuing because of --force Checking filesystem on /dev/md127 ...
You can also run smartctl from ssh (/dev/sda, etc), so need to use the liveCD there either.
root@RN102:~# smartctl --test=long /dev/sda smartctl 6.6 2017-11-05 r4594 [armv7l-linux-4.4.218.armada.1] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 127 minutes for test to complete. Test will complete after Tue Jun 13 20:35:03 2023 Use smartctl -X to abort test. root@RN102:~#
- tony359Jun 13, 2023Apprentice
Re btrfs I was thinking that exactly - how can the OS check files if they're part of a raid?
But I was under the impression that the disks should be unmounted in order for those checks to be properly done?
I'm referring to this message: https://community.netgear.com/t5/Using-your-ReadyNAS-in-Business/ReadyNAS-Pro-6-crashed-again/m-p/2316608/highlight/true#M199637
Also, 126 is the main data volume, should I also check the ones where the OS is stored? I have MD0 (4GB), MD1 (1.3GB), MD127 (1.8TB), MD126 (14.5TB).
- tony359Jun 14, 2023Apprentice
First disk passed long smart succesfully, now onto second one.
Meanwhile the NAS disappeared. I SSH via the second port and I revived it the usual way.
DMESG adds the following from yesterday
[Tue Jun 13 16:14:02 2023] eth1: network connection down [Tue Jun 13 16:14:05 2023] eth1: network connection up using port A [Tue Jun 13 16:14:05 2023] interrupt src: MSI [Tue Jun 13 16:14:05 2023] speed: 10 [Tue Jun 13 16:14:05 2023] autonegotiation: yes [Tue Jun 13 16:14:05 2023] duplex mode: full [Tue Jun 13 16:14:05 2023] flowctrl: none [Tue Jun 13 16:14:05 2023] tcp offload: enabled [Tue Jun 13 16:14:05 2023] scatter-gather: enabled [Tue Jun 13 16:14:05 2023] tx-checksum: enabled [Tue Jun 13 16:14:05 2023] rx-checksum: enabled [Tue Jun 13 16:14:05 2023] rx-polling: enabled [Tue Jun 13 18:08:49 2023] eth1: network connection down [Tue Jun 13 18:08:54 2023] eth1: network connection up using port A [Tue Jun 13 18:08:54 2023] interrupt src: MSI [Tue Jun 13 18:08:54 2023] speed: 1000 [Tue Jun 13 18:08:54 2023] autonegotiation: yes [Tue Jun 13 18:08:54 2023] duplex mode: full [Tue Jun 13 18:08:54 2023] flowctrl: symmetric [Tue Jun 13 18:08:54 2023] role: slave [Tue Jun 13 18:08:54 2023] tcp offload: enabled [Tue Jun 13 18:08:54 2023] scatter-gather: enabled [Tue Jun 13 18:08:54 2023] tx-checksum: enabled [Tue Jun 13 18:08:54 2023] rx-checksum: enabled [Tue Jun 13 18:08:54 2023] rx-polling: enabled [Tue Jun 13 18:55:48 2023] eth1: network connection down [Tue Jun 13 18:55:51 2023] eth1: network connection up using port A [Tue Jun 13 18:55:51 2023] interrupt src: MSI [Tue Jun 13 18:55:51 2023] speed: 10 [Tue Jun 13 18:55:51 2023] autonegotiation: yes [Tue Jun 13 18:55:51 2023] duplex mode: full [Tue Jun 13 18:55:51 2023] flowctrl: none [Tue Jun 13 18:55:51 2023] tcp offload: enabled [Tue Jun 13 18:55:51 2023] scatter-gather: enabled [Tue Jun 13 18:55:51 2023] tx-checksum: enabled [Tue Jun 13 18:55:51 2023] rx-checksum: enabled [Tue Jun 13 18:55:51 2023] rx-polling: enabled [Tue Jun 13 18:55:53 2023] eth1: network connection down [Tue Jun 13 18:55:56 2023] eth1: network connection up using port A [Tue Jun 13 18:55:56 2023] interrupt src: MSI [Tue Jun 13 18:55:56 2023] speed: 1000 [Tue Jun 13 18:55:56 2023] autonegotiation: yes [Tue Jun 13 18:55:56 2023] duplex mode: full [Tue Jun 13 18:55:56 2023] flowctrl: symmetric [Tue Jun 13 18:55:56 2023] role: master [Tue Jun 13 18:55:56 2023] tcp offload: enabled [Tue Jun 13 18:55:56 2023] scatter-gather: enabled [Tue Jun 13 18:55:56 2023] tx-checksum: enabled [Tue Jun 13 18:55:56 2023] rx-checksum: enabled [Tue Jun 13 18:55:56 2023] rx-polling: enabled [Tue Jun 13 18:58:30 2023] eth1: network connection down [Tue Jun 13 18:58:39 2023] eth1: network connection up using port A [Tue Jun 13 18:58:39 2023] interrupt src: MSI [Tue Jun 13 18:58:39 2023] speed: 1000 [Tue Jun 13 18:58:39 2023] autonegotiation: yes [Tue Jun 13 18:58:39 2023] duplex mode: full [Tue Jun 13 18:58:39 2023] flowctrl: symmetric [Tue Jun 13 18:58:39 2023] role: master [Tue Jun 13 18:58:39 2023] tcp offload: enabled [Tue Jun 13 18:58:39 2023] scatter-gather: enabled [Tue Jun 13 18:58:39 2023] tx-checksum: enabled [Tue Jun 13 18:58:39 2023] rx-checksum: enabled [Tue Jun 13 18:58:39 2023] rx-polling: enabled [Tue Jun 13 19:03:51 2023] eth1: network connection down [Tue Jun 13 19:04:50 2023] eth1: network connection up using port A [Tue Jun 13 19:04:50 2023] interrupt src: MSI [Tue Jun 13 19:04:50 2023] speed: 1000 [Tue Jun 13 19:04:50 2023] autonegotiation: yes [Tue Jun 13 19:04:50 2023] duplex mode: full [Tue Jun 13 19:04:50 2023] flowctrl: symmetric [Tue Jun 13 19:04:50 2023] role: slave [Tue Jun 13 19:04:50 2023] tcp offload: enabled [Tue Jun 13 19:04:50 2023] scatter-gather: enabled [Tue Jun 13 19:04:50 2023] tx-checksum: enabled [Tue Jun 13 19:04:50 2023] rx-checksum: enabled [Tue Jun 13 19:04:50 2023] rx-polling: enabled [Tue Jun 13 19:40:18 2023] eth1: network connection down [Tue Jun 13 19:40:20 2023] eth1: network connection up using port A [Tue Jun 13 19:40:20 2023] interrupt src: MSI [Tue Jun 13 19:40:20 2023] speed: 10 [Tue Jun 13 19:40:20 2023] autonegotiation: yes [Tue Jun 13 19:40:20 2023] duplex mode: full [Tue Jun 13 19:40:20 2023] flowctrl: none [Tue Jun 13 19:40:20 2023] tcp offload: enabled [Tue Jun 13 19:40:20 2023] scatter-gather: enabled [Tue Jun 13 19:40:20 2023] tx-checksum: enabled [Tue Jun 13 19:40:20 2023] rx-checksum: enabled [Tue Jun 13 19:40:20 2023] rx-polling: enabled [Tue Jun 13 19:40:23 2023] eth1: network connection down [Tue Jun 13 19:40:26 2023] eth1: network connection up using port A [Tue Jun 13 19:40:26 2023] interrupt src: MSI [Tue Jun 13 19:40:26 2023] speed: 1000 [Tue Jun 13 19:40:26 2023] autonegotiation: yes [Tue Jun 13 19:40:26 2023] duplex mode: full [Tue Jun 13 19:40:26 2023] flowctrl: symmetric [Tue Jun 13 19:40:26 2023] role: master [Tue Jun 13 19:40:26 2023] tcp offload: enabled [Tue Jun 13 19:40:26 2023] scatter-gather: enabled [Tue Jun 13 19:40:26 2023] tx-checksum: enabled [Tue Jun 13 19:40:26 2023] rx-checksum: enabled [Tue Jun 13 19:40:26 2023] rx-polling: enabled [Tue Jun 13 19:41:20 2023] eth1: network connection down [Tue Jun 13 19:41:34 2023] eth1: network connection up using port A [Tue Jun 13 19:41:34 2023] interrupt src: MSI [Tue Jun 13 19:41:34 2023] speed: 1000 [Tue Jun 13 19:41:34 2023] autonegotiation: yes [Tue Jun 13 19:41:34 2023] duplex mode: full [Tue Jun 13 19:41:34 2023] flowctrl: symmetric [Tue Jun 13 19:41:34 2023] role: master [Tue Jun 13 19:41:34 2023] tcp offload: enabled [Tue Jun 13 19:41:34 2023] scatter-gather: enabled [Tue Jun 13 19:41:34 2023] tx-checksum: enabled [Tue Jun 13 19:41:34 2023] rx-checksum: enabled [Tue Jun 13 19:41:34 2023] rx-polling: enabled [Tue Jun 13 20:30:37 2023] eth1: network connection down [Tue Jun 13 20:31:35 2023] eth1: network connection up using port A [Tue Jun 13 20:31:35 2023] interrupt src: MSI [Tue Jun 13 20:31:35 2023] speed: 1000 [Tue Jun 13 20:31:35 2023] autonegotiation: yes [Tue Jun 13 20:31:35 2023] duplex mode: full [Tue Jun 13 20:31:35 2023] flowctrl: symmetric [Tue Jun 13 20:31:35 2023] role: master [Tue Jun 13 20:31:35 2023] tcp offload: enabled [Tue Jun 13 20:31:35 2023] scatter-gather: enabled [Tue Jun 13 20:31:35 2023] tx-checksum: enabled [Tue Jun 13 20:31:35 2023] rx-checksum: enabled [Tue Jun 13 20:31:35 2023] rx-polling: enabled [Tue Jun 13 23:34:51 2023] eth1: network connection down [Tue Jun 13 23:34:54 2023] eth1: network connection up using port A [Tue Jun 13 23:34:54 2023] interrupt src: MSI [Tue Jun 13 23:34:54 2023] speed: 10 [Tue Jun 13 23:34:54 2023] autonegotiation: yes [Tue Jun 13 23:34:54 2023] duplex mode: full [Tue Jun 13 23:34:54 2023] flowctrl: none [Tue Jun 13 23:34:54 2023] tcp offload: enabled [Tue Jun 13 23:34:54 2023] scatter-gather: enabled [Tue Jun 13 23:34:54 2023] tx-checksum: enabled [Tue Jun 13 23:34:54 2023] rx-checksum: enabled [Tue Jun 13 23:34:54 2023] rx-polling: enabled [Tue Jun 13 23:36:26 2023] eth1: network connection down [Wed Jun 14 19:03:49 2023] eth1: network connection up using port A [Wed Jun 14 19:03:49 2023] interrupt src: MSI [Wed Jun 14 19:03:49 2023] speed: 100 [Wed Jun 14 19:03:49 2023] autonegotiation: yes [Wed Jun 14 19:03:49 2023] duplex mode: full [Wed Jun 14 19:03:49 2023] flowctrl: none [Wed Jun 14 19:03:49 2023] tcp offload: enabled [Wed Jun 14 19:03:49 2023] scatter-gather: enabled [Wed Jun 14 19:03:49 2023] tx-checksum: enabled [Wed Jun 14 19:03:49 2023] rx-checksum: enabled [Wed Jun 14 19:03:49 2023] rx-polling: enabled [Wed Jun 14 19:04:48 2023] eth1: network connection down [Wed Jun 14 19:04:51 2023] eth1: network connection up using port A [Wed Jun 14 19:04:51 2023] interrupt src: MSI [Wed Jun 14 19:04:51 2023] speed: 1000 [Wed Jun 14 19:04:51 2023] autonegotiation: yes [Wed Jun 14 19:04:51 2023] duplex mode: full [Wed Jun 14 19:04:51 2023] flowctrl: symmetric [Wed Jun 14 19:04:51 2023] role: slave [Wed Jun 14 19:04:51 2023] tcp offload: enabled [Wed Jun 14 19:04:51 2023] scatter-gather: enabled [Wed Jun 14 19:04:51 2023] tx-checksum: enabled [Wed Jun 14 19:04:51 2023] rx-checksum: enabled [Wed Jun 14 19:04:51 2023] rx-polling: enabled [Wed Jun 14 19:53:22 2023] eth0: network connection down [Wed Jun 14 19:53:32 2023] eth0: network connection up using port A [Wed Jun 14 19:53:32 2023] interrupt src: MSI [Wed Jun 14 19:53:32 2023] speed: 1000 [Wed Jun 14 19:53:32 2023] autonegotiation: yes [Wed Jun 14 19:53:32 2023] duplex mode: full [Wed Jun 14 19:53:32 2023] flowctrl: symmetric [Wed Jun 14 19:53:32 2023] role: slave [Wed Jun 14 19:53:32 2023] tcp offload: enabled [Wed Jun 14 19:53:32 2023] scatter-gather: enabled [Wed Jun 14 19:53:32 2023] tx-checksum: enabled [Wed Jun 14 19:53:32 2023] rx-checksum: enabled [Wed Jun 14 19:53:32 2023] rx-polling: enabled
As you can see, ETH1 seems to go up and down. However, that's the "good" port, the one connected to my desktop. Fair enough, I switched off the computer at night and maybe in the afternoon when I went out. But only twice, not so many times.
Uhm...
- StephenBJun 14, 2023Guru
Interesting that eth1 is also going down.
I am wondering if there is anything else in the logs around Jun 14 19:53:22. I'm thinking that you should check kernel.log, system.log, and systemd-journal.log.
Related Content
NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology!
Join Us!