Forum Discussion

Aspirant

May 24, 2022

RN214 goes Offline and NICs may be dead

Hello RN Forum, As you may remember from other posts (still on hiatus, sic), my current set up in Birmingham UK is made of: RN214a (4x WD40EFAX), FW 6.10.3, RAID 5 RN214b (3x WD80EFBX + 1x WD40EFR...

StephenB

Guru - Experienced User

May 24, 2022

berillio wrote:

Is it advisable to keep the double bonding on the RN424, given that the speed advantage is minimal

Why (given that the speed advantage is minimal)?

berillio wrote:

I can switch off the RN 214a, remove all disks (ordered & labelled), load all the disks from the faulty RN214b and power it up. The R214a should read that full array. That should allow me to transfer all the data on a WD10EFAX currently empty.

Correct. You can also migrate the disks to the RN424 (or in the other direction) - though the system will need to switch the OS from arm->x86 (or vice versa) when you do that.

berillio wrote:

The RN214a is currently using eth0. Is it advisable to switch to eth1

I don't think it matters though it would do no harm.

berillio wrote:
The RN214s were purchased in April / May 2020 and the RN424 in April 2021

The hardware warranty is 3 years, so you could request an RMA for RN214b.

berillio

Aspirant

May 26, 2022

Thank you Stephen B;

“ Correct. You can also migrate the disks to the RN424 (or in the other direction) - though the system will need to switch the OS from arm->x86 (or vice versa) when you do that.”

I went for a simple array migration to the RN214a (which now calls itself RN214b, but using a different IP).

Unfortunately it showed the same problem as the previous “Unit b”. The file system came up but very slowly. RAIDair showed the unit online, but not the Admin & Browse icons for at least five minutes.

Then I instructed a full data (minus the snapshots) “Teracopy” over the WD10 target drive, but that did not start (because the target drive was too small by 76Gb), but I only realised that 2h later when I checked it, and by then the unit was frozen; I unplugged it and restarted, simply to see a CPU temperature of 71° and likewise extremely high temps for the drives. OUCH.

I let it cool down for 2 or 3h, then I managed to transfer 78Gb of data before it hung. This morning I tried to copy one folder, data transfer speeds were ~104Mb/sec but then it froze 10 seconds before the end. This evening, the file system was up for a matter of seconds before hourglassing; the unit hung, although the temps were lower than 30° all around (incidentally, I moved the unit to a more “exposed” position, removed the side cheeks and top panel to allow more air in; the drives were also removed and left on the desk to cool down and inserted just before rebooting).

Now I don’t know anymore what to think

Should I return the RN214a array to the “Unit A” and check if that is still functional?

Should I instead test the RN214a array in “Unit B” to see if that hardware is faulty as I assumed it was?

Should I presume that the 6.10.3 FW on the RN214b array has got corrupted somewhat, upgrade it to (say) 6.10.4 and see if an uncorrupted firmware can read the exhisting file system?

Should I try the RN214b array in the RN424, maybe some more robust hardware (also with a much bigger fan) could read the file system? But that would imply a firmware upgrade anyway (arm to x86_64) so basically also similar to the previous option + hardware advantage?

Thank to everybody in advance

p.s This is the content of diskinfo.log from the logs download taken before switching the array to the RN214a unit:

Device: sda

Controller: 0

Channel: 0

Model: WDC WD80EFBX-68AZZN0

Serial: VRHBHJRK

Firmware: 85.00A85W

Class: SATA

RPM: 7200

Sectors: 15628053168

Pool: data

PoolType: RAID 5

PoolState: 1

PoolHostId: 1132353a

Health data

ATA Error Count: 0

Reallocated Sectors: 0

Reallocation Events: 0

Spin Retry Count: 0

Current Pending Sector Count: 0

Uncorrectable Sector Count: 0

Temperature: 41

Start/Stop Count: 19

Power-On Hours: 4764

Power Cycle Count: 19

Load Cycle Count: 215

Device: sdb

Controller: 0

Channel: 1

Model: WDC WD80EFBX-68AZZN0

Serial: VRHBMEDK

Firmware: 85.00A85W

Class: SATA

RPM: 7200

Sectors: 15628053168

Pool: data

PoolType: RAID 5

PoolState: 1

PoolHostId: 1132353a

Health data

ATA Error Count: 0

Reallocated Sectors: 0

Reallocation Events: 0

Spin Retry Count: 0

Current Pending Sector Count: 0

Uncorrectable Sector Count: 0

Temperature: 44

Start/Stop Count: 18

Power-On Hours: 4744

Power Cycle Count: 18

Load Cycle Count: 213

Device: sdc

Controller: 0

Channel: 2

Model: WDC WD80EFBX-68AZZN0

Serial: VRGR7MNK

Firmware: 85.00A85W

Class: SATA

RPM: 7200

Sectors: 15628053168

Pool: data

PoolType: RAID 5

PoolState: 1

PoolHostId: 1132353a

Health data

ATA Error Count: 0

Reallocated Sectors: 0

Reallocation Events: 0

Spin Retry Count: 0

Current Pending Sector Count: 0

Uncorrectable Sector Count: 0

Temperature: 43

Start/Stop Count: 17

Power-On Hours: 4615

Power Cycle Count: 17

Load Cycle Count: 207

Device: sdd

Controller: 0

Channel: 3

Model: WDC WD40EFRX-68N32N0

Serial: WD-WCC7K6YX6PYY

Firmware: 82.00A82W

Class: SATA

RPM: 5400

Sectors: 7814037168

Pool: data

PoolType: RAID 5

PoolState: 1

PoolHostId: 1132353a

Health data

ATA Error Count: 0

Reallocated Sectors: 0

Reallocation Events: 0

Spin Retry Count: 0

Current Pending Sector Count: 0

Uncorrectable Sector Count: 0

Temperature: 33

Start/Stop Count: 1158

Power-On Hours: 17642

Power Cycle Count: 78

Load Cycle Count: 1277

StephenB
Guru - Experienced User
May 27, 2022
berillio wrote:

I went for a simple array migration to the RN214a (which now calls itself RN214b, but using a different IP).

Unfortunately it showed the same problem as the previous “Unit b”. The file system came up but very slowly. RAIDair showed the unit online, but not the Admin & Browse icons for at least five minutes.

So the disks in RN214b cause the same problem when migrated to RN214a.

I would next try the RN214a disks in RN214b, and confirm that the problem doesn't occur in RN214b with RN214a's disks.

I'd also take a look at the OS partition fullness (not that likely to be the issue, but easy to check). Look in volume.log, and scroll down to the df -h section. /dev/md0 is the OS partition.

=== df -h === Filesystem Size Used Avail Use% Mounted on udev 10M 4.0K 10M 1% /dev /dev/md0 3.7G 633M 2.9G 18% /

Did you have any apps running in RN214b?

It might be worth asking a mod ( Marc_V or JeraldM ) to review the entire log zip of the problem system.
- Sandshark
  Sensei
  May 27, 2022
  While removing the sides may help with the CPU temperature, it may have the opposite effect on the drives because the air entering through the side doesn't go over the drives. Opening the door is typically a better approach.
  
  But those temperatures, assuming a reasonable room temperature, indicate there is a whole lot of activity going on (like a scrub) or the fan is not working properly or is blocked. Whatever that activity is could be bogging down the unit. If you have SSH access, see what top says is running and if anything is using a huge amount of CPU.
- berillio
  Aspirant
  May 27, 2022
  Thank you Stephen B;
  “Did you have any apps running in RN214b?”
  No, no apps. I had Plex in RN214a, but I used it, does nothing for me, I will remove it.
  
  === df -h ===
  Filesystem      Size Used Avail Use% Mounted on
  udev             10M 4.0K   10M   1% /dev
  /dev/md0        3.7G 655M 2.9G 19% /
  tmpfs          1009M     0 1009M   0% /dev/shm
  tmpfs          1009M 1.5M 1008M   1% /run
  tmpfs           505M 3.2M 502M   1% /run/lock
  tmpfs          1009M     0 1009M   0% /sys/fs/cgroup
  /dev/md126       19T   12T 6.7T 64% /data
  /dev/md126       19T   12T 6.7T 64% /home
  /dev/md126       19T   12T 6.7T 64% /apps
  /dev/md126       19T   12T 6.7T 64% /var/ftp/214-B_Private
  tmpfs           4.0K     0 4.0K   0% /data/214-B_Private/snapshot
  /dev/md126       19T   12T 6.7T 64% /data/214-B_Private/snapshot/c_2021_10_31__01_00_25
  /dev/md126       19T   12T 6.7T 64% /data/214-B_Private/snapshot/c_2021_10_31__02_00_28
  Then a million of snapshots.
  
  “I would next try the RN214a disks in RN214b, and confirm that the problem doesn't occur in RN214b with RN214a's disks”
  
  Did that. But Unit-b (I guess) failed to read the file system properly. RAIDair reported “Volume inactive”, it thinks that the disks were 100% full. On the admin page the disks all appeared red instead of blue.
  I cannot remember how full it was, but my guess is ~70%, maybe 75%. You may remember my post “vertical expansion of the wrong Nas”, I expanded RN214b instead RN214a exactly because RN214a did not need to be expanded.
  I switched Unit B from the dashboard – and it DID switch off (took 52seconds): it might have been the first time I did not need to “pull the plug”.
  
  Many thanks
  - berillio
    Aspirant
    May 27, 2022
    Hello Sandshark, thank you for coming in.
    Thanks for the comment on the chassis side. What about the top cover?
    Re the fan, I mentioned in the first post that I tried the fan from a dead 104, and it did not seem to be the problem; I wasn’t sure if the CONTROLLING of the fan was correct (changing the fan mode to Cool did not seem to have much of an effect, while changing to Balanced on the RN424 caused an immediate response, but that might have been because the 8Tb disks in the RN424 were actually warmer at that time).
    On a different occasion, when I saw the high temperatures, the fan was running ~2700rpm. I am NOT saying that the controlling is correct, it may kick in too late, I simply think that there is something else in the reading of the file system which has gone wrong in “Unit-b”, which also just misread the file system of Unit-A.
    Incidentally, the dashboard on the RN214b did not see to be 100% either: clicking the refresh button did not have any effect on temperatures and fan rpm, refreshing the page did not have any effect either, totally identical figures.
    Thanks again