ReadyNAS Pro 6 crashed again

tony359
Apprentice
Jun 09, 2023
Considering it’s happening very rarely it’s not such a bad idea.
I’ve run overnights of ram tests but maybe it didn’t catch it because it happens very rarely.
I still have the original CPU so I could try that too.

That said, the fact that just the network went down last time is suspicious. A ram or cpu issue would have much bigger impact I reckon. I might want to put a switch in between the nas and the main switch. It’s always been that switch and maybe it’s faulty. After all the nas stopped crashing when I took it off the main network - which takes the main switch out of the equation.
And it worked for a while while connected to my main desktop, again no main switch involved.

Uhm… I like this idea 🙂
KDS
Guide
Jun 09, 2023
I have my router dishing out DHCP addresses>>>Unmanaged 2.5G switch>>>both NICs into switch.
Static IP's on both Netgear NIC settings (IPV4) and router address.
Router set to static IP addresses for both NICs.
Since doing that both NICs are very stable.
Ram is 2 x 2GB PC800.
CPU is now E7600, Just upgraded from E5300, find this much faster than the Q6600, though my NAS is mainly used for backup, and file server, not really serving any Apps. E7600 runs faster and much cooler than Q6600.
StephenB
Guru - Experienced User
Jun 09, 2023
tony359 wrote:
I might want to put a switch in between the nas and the main switch. 🙂

Makes sense. You could also swap the two connections, and see if the problem moves.
Sandshark
Sensei
Jun 09, 2023
Is it a "green" switch? I've had a couple issues with ReadyNAS and green switches, though I've believed the problem units already had partly damaged LAN ports. My main switch has a "green" on/off selection. Try turning off power saving mode if yours does. Otherwise, a non-green switch in between might be the answer.
tony359
Apprentice
Jun 09, 2023
It's a Netgear! 🙂

GS108Ev2. "partly" managed. V1.00.12 (latest). DHCP disabled. DHCP is handled by the router (Fritzbox) which issues the same IP to the NAS MAC address. All settings are default to be honest.
I have tried a static IP in the past with no change - though, I'm confident those swollen capacitors might have contributed to SOME of the issues I was having.

Today's new issue is... the NAS is online, I can see the files. I can SSH into it. But web interface shows an "500 - internal server error". This is on both ports. Sigh 🙂
Before I just reboot the box, how would I restart the web interface from SSH?

I'll install a dumb switch between the NAS and the main switch - with new cables.

The 7600 seems to be a good option. It's only 2 cores but it's faster than the cores in the 6600. I wonder how much a NAS used as a "file system" is actually using a multi-core CPU. And the 7600 as you say is cooler.

I think I'll fix this issue first then I might try the 7600 as well, thanks for the hint!

Apprentice

Jun 09, 2023

I feel that the below is relevant with my issue. Again, the NAS is accessible, I can write a file on the data folder via nano. I just lost the web interface.

These weird failures are incredibly annoying. I'd like to test what itachi2 recommended, can someone possibly point me to the right direction? See https://community.netgear.com/t5/Using-your-ReadyNAS-in-Business/ReadyNAS-Pro-6-crashed-again/m-p/2316638/highlight/true#M199640

Thanks 🙂

root@Enterprise-NAS:/# systemctl status apache2
Failed to get properties: Activation of org.freedesktop.systemd1 timed out
root@Enterprise-NAS:/#
root@Enterprise-NAS:/#
root@Enterprise-NAS:/#
root@Enterprise-NAS:/#
root@Enterprise-NAS:/# systemctl restart apache2
Failed to restart apache2.service: Activation of org.freedesktop.systemd1 timed out
See system logs and 'systemctl status apache2.service' for details.
root@Enterprise-NAS:/# sudo systemctl status apache2
-bash: sudo: command not found
root@Enterprise-NAS:/# su
root@Enterprise-NAS:/# systemctl status apache2
Failed to get properties: Activation of org.freedesktop.systemd1 timed out
root@Enterprise-NAS:/# systemctl status readynasd
Failed to get properties: Activation of org.freedesktop.systemd1 timed out
root@Enterprise-NAS:/# ps aux | grep readynasd
root     26625  0.0  0.0  17836  1008 pts/2    S+   19:58   0:00 grep readynasd
root@Enterprise-NAS:/# service ctscand stop
Failed to stop ctscand.service: Connection timed out
See system logs and 'systemctl status ctscand.service' for details.
Failed to get load state of ctscand.service: Connection timed out
root@Enterprise-NAS:/# systemctl restart readynasd
Failed to restart readynasd.service: Activation of org.freedesktop.systemd1 timed out
See system logs and 'systemctl status readynasd.service' for details.
root@Enterprise-NAS:/# systemctl status readynasd.service
Failed to get properties: Activation of org.freedesktop.systemd1 timed out

KDS
Guide
Jun 09, 2023
Just another hardware thing that has probably already happened.
1. After good PSU installed was CMOS cleared?
2. Has CMOS battery been checked?
3. Are you keeping it simple with just 1 HDD, possibly 2 (raid 1), with HDDs especially raid arrays cleaned and cleared on another PC prior to installing. Granted you may have data on your system, though remove those HDDs and start fresh, with known clean and good drives? I tested with some old 320GB junk drives I had kicking about. I also encountered NIC, web access, and HDD problems prior to replacing the PSU. My original 7200 WD HDDs were only seen as 5900, then when I added a newer 7200 WD HDD it was seen as 7200, it did not like the mismatch in HDD speed that it saw.
Though finally did clean HDDs. I think Web interface may be associated with what is already on the HDDs.
My HDD and hardware issues were resolved when I replaced PSU. Both types drives 3 x 7200 seen as 5900, and 3 x 7200 seen as 7200, and running together fine.
4. BTW are you using RAIDar 6.5.0.
tony359
Apprentice
Jun 09, 2023
Just another hardware thing that has probably already happened.
1. After good PSU installed was CMOS cleared?
--
No, I did not update the BIOS so I didn't think of clearing the CMOS. I can try.

2. Has CMOS battery been checked?
--
No. Good point.

3. Are you keeping it simple with just 1 HDD, possibly 2 (raid 1), with HDDs especially raid arrays cleaned and cleared
--
No. Reason is: last time the system behaved, it lasted for 2 months. I cannot stay without my data for 2 months.
The only two options here are
a. Fix it with the current setup
b. try a factory default and migrate a backup

Testing with 2 random HDDs is likely not gaining any evidence I'm afraid.

I also encountered NIC, web access, and HDD problems prior to replacing the PSU. My original 7200 WD HDDs were only seen as 5900, then when I added a newer 7200 WD HDD it was seen as 7200, it did not like the mismatch in HDD speed that it saw.
--
Unfortunately the replacement PSU did not solve all the problems. I'm confident some of the issues I experienced were caused by the bad PSU but the NAS is still misbehaving I'm afraid.
All my HDDs are WD RED, 5400-ish (4TB are a bit slower than the 6TB).

4. BTW are you using RAIDar 6.5.0.
--
No. I am on OS6.

I appreciate a factory reset would be a good idea but I have 13TB on that NAS and I don't know where to store them for a backup. Yes, the NAS is more or less fully backed up (locally and online) but it would take me forever to restore those backups so I'd consider that as an emergency option only.
I could see if I could hire another NAS, transfer the data, reset and restore. But somehow I am not confident my problems would go away 🙂

Thanks for your input!
tony359
Apprentice
Jun 10, 2023
Little update.

I checked the battery, it's ok, 3.1V. I replaced it some time ago when I serviced the box.
I re-reset the BIOS (only thing I change is the default fan speed!)
I swapped position of HDD0 with HDD4. I sprayed dry contact cleaner on the backplane and on the HDDs, cleaned with a small q-tip.
Once the NAS was powered up again, HDD0 failed to show up on the BIOS splash page straight away. So it's not the HDD and, to be honest, I feel that that might be a red herring. I never had issues with HDD0 so maybe it's a BIOS bug which then does not affect the software. No idea. But I now know it's not the drive.

I've added a TP-Link switch between the main switch and the NAS.

Next: throwing the NAS out of the window.
tony359
Apprentice
Jun 11, 2023
And no, the NAS disappeared again.
Solution: SSH into other port and ifconfig the other port DOWN and then UP again.

I could try swapping the config but I think I tried that in the past already.

If someone could give me some directions for checking the HDDs offline as mentioned above, that would be great! 🙂

Thanks
StephenB
Guru - Experienced User
Jun 11, 2023
tony359 wrote:

And no, the NAS disappeared again.

Solution: SSH into other port and ifconfig the other port DOWN and then UP again.

Have you tried swapping the NIC ports?
tony359
Apprentice
Jun 11, 2023
That's what I meant with "swapping the config" sorry. As in swap the IP addresses between ports.

I'll try but I think I tried that in the past already. 100% worth a try.
StephenB
Guru - Experienced User
Jun 12, 2023
tony359 wrote:

That's what I meant with "swapping the config" sorry. As in swap the IP addresses between ports.

I meant connecting the ethernet going to the PC to the switch, and vice versa. Then seeing if the problem was limited to NIC 1.
tony359
Apprentice
Jun 12, 2023
The NICs are on two different IP range - one main network, one PC only.

What used to be on main network is now directly connected to the PC and what used to be connected to the PC is now connected to the main network and I've swapped the IP addresses accordingly.

I did that yesterday and I've just checked: NAS has disappeared. Sigh!
I SSH'd through the other NIC, restarted it and it worked as usual.

So
- It's not the specific NIC
- It's not the switch

It's curious that it's always the NIC on the main network failing and not the other.

Help 🙂
schumaku
Guru - Experienced User
Jun 12, 2023
As you are in the lucky situation having an alternate LAN interface (and IP subnet) available. what does the kernel output show when the device "disappeared", ...?

# dmesg

The risk that a network adapter does become flakey is very small. More typical, the adapters resp. the data connectivity does completely disappear completely, and the UPnP OS does no longer detect the adapter.

Most problems on such NASes are caused by RAID becoming inoperable, due to aged or breaking storage blocks.

Do you have a known working, reliable SATA storage block at hand to set-up the NAS with one single device volume, or two on a RAID 1 volume? Remove the potentially unhealthy storage blocks, and restart a test from scratch.
tony359
Apprentice
Jun 12, 2023
schumaku

Thanks, I'll test next time.
Many have (rightly) recommended a test with a couple of random HDDs. I have plenty so that wouldn't be an issue.

My concern is that sometimes the NAS stays online for weeks without issues and I really cannot keep my data offline for so long.

Is there a way to do an offline test of my drives? Someone recommended booting from a Debian Live-USB but I would need some minor guidance on that. I know how to make the USB, I'm just making sure (as much as possible) I don't do anything that can destroy my data.

Thanks! 🙂
schumaku
Guru - Experienced User
Jun 12, 2023
Start with retrieving the SMART data from the storage block (aka. disk). Next trigger a full SMART check (rapid, then full) of the storage block. Then retrieve the SMART data again.

You can do this on any platform, without erasing or re-partition or re-format the storage block - if done carefully of course.
tony359
Apprentice
Jun 12, 2023
Thanks.
I’ll Google how to do that. 🙂

Just to double check: do you mean doing those checks on the NAS itself while it’s online?
StephenB
Guru - Experienced User
Jun 13, 2023
tony359 wrote:

Is there a way to do an offline test of my drives? 🙂

There is an on-line test in the maintenance menu you can use. That runs the full built-in smart test on all the drives in the volume.

You can also use smartctl -x /dev/sda from ssh to see more errors (UNCs in particular) on sda (or whatever disk you wish),

As far as off-line goes, the simplest way is to connect the drive to a Windows PC and run the vendor diag - Dashboard for WDC, and Seatools for Seagate. Unfortunately they don't run on MacOS.

But it seems to me that your symptoms are pointing either to the switch or perhaps the cable going from the NAS to the switch. It's always the NIC port connected to that switch that fails, and the other NIC always continues to work fine.

tony359

Apprentice

Jun 13, 2023

Hi Stephen,

No, the ports were swapped last time - also the switch and the cable. So it's not a NIC or Network issue. Well. It ALWAYS fails on that NETWORK so it could be something on my main network. But on this occasion the NAS was wired to the main switch on another port and through an additional switch. So if it's something with that network, it's not a HW issue.

The online maintenance runs periodically. The logs show an "offline" test though. How should I read that? The drive is now 51888hrs.

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      90%     50227         -
# 2  Extended offline    Completed without error       00%     48081         -
# 3  Extended offline    Completed without error       00%     45875         -
# 4  Extended offline    Completed without error       00%     43691         -
# 5  Extended offline    Completed without error       00%     41536         -
# 6  Extended offline    Completed without error       00%     39834         -
# 7  Extended offline    Completed without error       00%     37636         -
# 8  Extended offline    Completed without error       00%     35455         -
# 9  Extended offline    Completed without error       00%     33273         -
#10  Extended offline    Completed without error       00%     31118         -
#11  Extended offline    Completed without error       00%     28912         -
#12  Extended offline    Completed without error       00%     26707         -
#13  Extended offline    Completed without error       00%     24525         -
#14  Extended offline    Completed without error       00%     22554         -
#15  Extended offline    Completed without error       00%     20712         -
#16  Extended offline    Completed without error       00%     19182         -
#17  Short offline       Completed without error       00%        82         -
#18  Short offline       Completed without error       00%        63         -

I ran smartctl -x in the past and posted the output here earlier on this thread. I didn't spot anything but I am not an expert. There are UNC errors on SDA (which I now moved to SDE) but at 7872 hours, a few years ago! 🙂

Error 159 [14] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 4b 2b cc 40 40 00  Error: WP at LBA = 0x4b2bcc40 = 1261161536

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 04 00 00 08 00 00 4b 2b c8 40 40 08     14:59:14.849  WRITE FPDMA QUEUED
  60 04 00 00 00 00 00 4b 2b cc 40 40 08     14:59:14.849  READ FPDMA QUEUED
  ef 00 10 00 02 00 00 00 00 00 00 a0 08     14:59:14.849  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 00 00 00 00 00 e0 08     14:59:14.849  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08     14:59:14.849  IDENTIFY DEVICE

Error 158 [13] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 4b 2b cc 40 40 00  Error: UNC at LBA = 0x4b2bcc40 = 1261161536

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 04 00 00 08 00 00 4b 2b cc 40 40 08     14:59:11.031  READ FPDMA QUEUED
  61 04 00 00 00 00 00 4b 2b c8 40 40 08     14:59:11.031  WRITE FPDMA QUEUED
  ef 00 10 00 02 00 00 00 00 00 00 a0 08     14:59:11.031  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 00 00 00 00 00 e0 08     14:59:11.031  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08     14:59:11.030  IDENTIFY DEVICE

Error 157 [12] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 4b 2b cc 40 40 00  Error: WP at LBA = 0x4b2bcc40 = 1261161536

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 04 00 00 08 00 00 4b 2b c8 40 40 08     14:59:07.223  WRITE FPDMA QUEUED
  60 04 00 00 00 00 00 4b 2b cc 40 40 08     14:59:07.223  READ FPDMA QUEUED
  ef 00 10 00 02 00 00 00 00 00 00 a0 08     14:59:07.223  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 00 00 00 00 00 e0 08     14:59:07.223  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08     14:59:07.223  IDENTIFY DEVICE

Error 156 [11] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 4b 2b cc 40 40 00  Error: UNC at LBA = 0x4b2bcc40 = 1261161536

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 04 00 00 08 00 00 4b 2b cc 40 40 08     14:59:03.405  READ FPDMA QUEUED
  61 04 00 00 00 00 00 4b 2b c8 40 40 08     14:59:03.405  WRITE FPDMA QUEUED
  ef 00 10 00 02 00 00 00 00 00 00 a0 08     14:59:03.405  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 00 00 00 00 00 e0 08     14:59:03.405  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08     14:59:03.405  IDENTIFY DEVICE

Error 155 [10] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 4b 2b cc 40 40 00  Error: WP at LBA = 0x4b2bcc40 = 1261161536

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 04 00 00 08 00 00 4b 2b c8 40 40 08     14:58:59.720  WRITE FPDMA QUEUED
  60 04 00 00 00 00 00 4b 2b cc 40 40 08     14:58:59.720  READ FPDMA QUEUED
  ef 00 10 00 02 00 00 00 00 00 00 a0 08     14:58:59.720  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 00 00 00 00 00 e0 08     14:58:59.720  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08     14:58:59.719  IDENTIFY DEVICE

Error 154 [9] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 4b 2b cc 40 40 00  Error: UNC at LBA = 0x4b2bcc40 = 1261161536

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 04 00 00 08 00 00 4b 2b cc 40 40 08     14:58:55.900  READ FPDMA QUEUED
  61 04 00 00 00 00 00 4b 2b c8 40 40 08     14:58:55.900  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 00 00 00 00 00 e0 08     14:58:55.873  FLUSH CACHE EXT
  60 00 08 00 08 00 00 00 7f 22 18 40 08     14:58:55.838  READ FPDMA QUEUED
  61 00 02 00 00 00 00 00 00 00 48 40 08     14:58:55.838  WRITE FPDMA QUEUED

Error 153 [8] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 4b 2b c8 40 40 00  Error: UNC at LBA = 0x4b2bc840 = 1261160512

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 04 00 00 00 00 00 4b 2b c8 40 40 08     14:58:52.283  READ FPDMA QUEUED
  ef 00 10 00 02 00 00 00 00 00 00 a0 08     14:58:52.283  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 00 00 00 00 00 e0 08     14:58:52.283  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08     14:58:52.282  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08     14:58:52.282  SET FEATURES [Set transfer mode]

Error 152 [7] occurred at disk power-on lifetime: 7872 hours (328 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 4b 2b c8 40 40 00  Error: UNC at LBA = 0x4b2bc840 = 1261160512

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 04 00 00 00 00 00 4b 2b c8 40 40 08     14:58:48.786  READ FPDMA QUEUED
  ef 00 10 00 02 00 00 00 00 00 00 a0 08     14:58:48.786  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 00 00 00 00 00 e0 08     14:58:48.786  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08     14:58:48.786  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08     14:58:48.786  SET FEATURES [Set transfer mode]

I am Windows so that's fine but wouldn't be better to run the tests on a Linux system so the file system can be checked as well? Also I think I think I'd prefer the disks to stay unmounted so I know I have less chances of damaging the RAID.

Can I start the NAS from a Debian live-USB? I could run the checks from there, assuming VGA works there. And what do you think of that suggestion of running btrfs-check on the drives? I don't dislike the idea of checking the file system.

schumaku

The NAS disappeared again so I've run dmseg and it's attached (this forum lacks the ability to attach text files!).

Do I see lots of network going down messages after what seems to be a gap? And both ETH0 and ETH1.

Disabling and re-enabling ETH0 worked as usual.

And yes, I've now disabled IPv6 (it got re-enabled when I swapped the IPs I think)

Sandshark
Sensei
Jun 13, 2023
Yes, you can start a legacy NAS from a Debian Live USB (or even DOS or Windows). Native OS6 models are more picky about what they will start from.
StephenB
Guru - Experienced User
Jun 13, 2023
tony359 wrote:

I am Windows so that's fine but wouldn't be better to run the tests on a Linux system so the file system can be checked as well? Also I think I think I'd prefer the disks to stay unmounted so I know I have less chances of damaging the RAID.

I don't think so. If you needed that, I'd do it in the NAS.

I really don't see how this can be the disks or the file system. If it were, the second NIC wouldn't be responsive when the problem occurs. Plus normal operation wouldn't resume when you set the interface down and then up again.

tony359 wrote:

No, the ports were swapped last time - also the switch and the cable. So it's not a NIC or Network issue. Well. It ALWAYS fails on that NETWORK so it could be something on my main network.

I think definitely a network issue, though perhaps not the physical layer. The puzzle is what.

Are you using the NAS differently on the main network than you are on the PC connection?

The history here is of course extensive, and I'm have trouble keeping everything straight. Did the NAS ever lock up when it was only connected to the main network (with the PC NIC disconnected)?

tony359 wrote:

The online maintenance runs periodically. The logs show an "offline" test though. How should I read that? The drive is now 51888hrs.

The "extended offline" record is actually the test you run from the maintenance settings. No idea why is it described as "offline" by smartctl.

You should also see it at the end of volume.log. It looks like the NAS crashed (or was shut down) before the most recent test finished.
tony359
Apprentice
Jun 13, 2023
>I don't think so. If you needed that, I'd do it in the NAS.

>I really don't see how this can be the disks or the file system. If it were, the second NIC wouldn't be responsive when the >problem occurs. Plus normal operation wouldn't resume when you set the interface down and then up again.

I appreciate your view and I don't disagree with it.
But this has been going on for months and I've tried many things short of a new set of HDDs.

Before I start messing up with my data I'd like to exhaust all the options.

One of them is to do an offline check via Live-CD. As I am not super-skilled with Linux and I care about my data, can someone roughly guide me so I don't obliterate my data 🙂

I guess I'll boot from a Live USB, the 5 RAID HDDs are not going to be mounted by default.
I can then run

btrfs-check --readonly /dev/sd(x)

This should check the file system?

Then smartctl -t long /dev/sd(x)

Anything else anybody can think I should do while the HDDs are offline?

>I think definitely a network issue, though perhaps not the physical layer. The puzzle is what.
>Are you using the NAS differently on the main network than you are on the PC connection?
>The history here is of course extensive, and I'm have trouble keeping everything straight. Did the NAS ever lock up >when it was only connected to the main network (with the PC NIC disconnected)?

The PC and the NAS are plugged on the same switch. There is nothing running on the NAS. I only use it as File System.
I appreciate the history is long and I thank you for bearing with me for so long and not suggesting I should go buy a Qnap 🙂

The second NIC connected to the PC is a recent addition as I discovered that when the NAS disappears I can still access it via the other NIC. The behaviour hasn't changed since I also plugged the PC directly into the NAS.

Months ago, the NAS stopped misbehaving when I completely disconnected it from ANY networks.
A week later I plugged it into the PC only (no main network, no internet)
Some weeks of good behaviour later, I put the NAS back on the main network, removing some port forwarding I had in the main router.

It worked PERFECTLY for 2 months.

Then it started disappearing twice a day. Out of the blue.

This is why I am pursuing unlikely routes: the above events point to NOTHING! 🙂
tony359
Apprentice
Jun 13, 2023
quick addendum:
I've made a live-USB of Debian, played with it and a random HDD which I formatted btrfs.
If anybody has any suggestions on what to test while offline, please do let me know!

Also, if someone has any suggestions on what NOT to do while playing with those HDD, please also do let me know!

StephenB

Guru - Experienced User

Jun 13, 2023

You'd need to assemble the RAID array and mount it in order to run btrfs check.

Since your system boots, you can just run ssh (logging in as root), and run btrfs check from there. The device would be /dev/md127 (the raid array virtual disk).

Use --force because the file system is mounted. It won't try to repair anything, so no need to worry about read-only. Don't write anything to the data volume while it is running.

root@RN102:~# btrfs check --force /dev/md127
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on /dev/md127
...

You can also run smartctl from ssh (/dev/sda, etc), so need to use the liveCD there either.

root@RN102:~# smartctl --test=long /dev/sda
smartctl 6.6 2017-11-05 r4594 [armv7l-linux-4.4.218.armada.1] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 127 minutes for test to complete.
Test will complete after Tue Jun 13 20:35:03 2023

Use smartctl -X to abort test.
root@RN102:~#

Forum Discussion

Related Content

RAX54 v1.0 - Constant Crashing

readynas crashing

Orbi 970 - Crash after Speedtest

ReadyNAS 2120 crashes after 6.4.1 update

Creating / Deleting large file crashes ReadyNAS

NETGEAR Academy

ProSupport for Business