M4300-12X12F unstable on 12.0.19.x

Question

We have approximately 15 Netgear M4300 stacks where we hit a problem upgrading firmware. We've subsequently paused updating the others.

The following stacks upgraded without problem and remain stable afterwards:

2 x M4300-24X24F + 2 x M4300-28G
2 x M4300-8X8F
2 x M4300-24X

We however had a problem with the following stack, where the 2nd unit was rebooting regularly (between 1-5 hours of update between restarts):

2 x M4300-12X12F

Unit 1 has remained stable but unit 2 reports the following issue:

May 23 02:54:40 zanjnb01-swd1l5-02-2 BSP[unitMgrTask]: cpu_utils.c(1294) 25321 %% CPLD(0x30): Unable to read @reg(0x50).
May 23 02:54:40 zanjnb01-swd1l5-02-2 BSP[unitMgrTask]: cpu_utils.c(1294) 25322 %% CPLD(0x30): Unable to read @reg(0x02).
May 23 02:54:40 zanjnb01-swd1l5-02-2 BSP[unitMgrTask]: cpu_utils.c(1322) 25323 %% CPLD(0x30): Unable to write data(0xa5)@reg(0x02).
May 23 02:54:40 zanjnb01-swd1l5-02-2 BSP[unitMgrTask]: cpu_utils.c(1294) 25324 %% CPLD(0x30): Unable to read @reg(0x10).
May 23 02:54:41 zanjnb01-swd1l5-02-2 BSP[unitMgrTask]: cpu_utils.c(1294) 25325 %% CPLD(0x30): Unable to read @reg(0x11).
May 23 02:54:41 zanjnb01-swd1l5-02-2 BSP[unitMgrTask]: cpu_utils.c(1294) 25326 %% CPLD(0x30): Unable to read @reg(0x19).
May 23 02:54:41 zanjnb01-swd1l5-02-2 BSP[unitMgrTask]: cpu_utils.c(1294) 25327 %% CPLD(0x30): Unable to read @reg(0x50).
May 23 02:54:41 zanjnb01-swd1l5-02-1 BOXSERV[boxs Req]: boxs.c(1452) 25328 %% Unit 2 power supply 1 FAILURE event (4) occurred.
May 23 02:54:41 zanjnb01-swd1l5-02-1 TRAPMGR[boxs Req]: traputil.c(795) 25329 %% Power supply state change alarm: Power Supply Unit: 2 ID: 1 Event: 4 - Failure
May 23 02:54:41 zanjnb01-swd1l5-02-2 BSP[unitMgrTask]: cpu_utils.c(1294) 25330 %% CPLD(0x30): Unable to read @reg(0x1a).
May 23 02:54:41 zanjnb01-swd1l5-02-2 BSP[unitMgrTask]: cpu_boxs.c(1259) 25331 %% FAN Module interrupt received on the unit 2 fan 1 fail

We interpreted the above as a bad PSU in unit 2 and replaced the PSU with a spare, replaced the power cable and move it to another socket on a different PDU after it still restarted. The above message about the PSU having failed is however most probably a red herring, in that the FAN is also reported as having failed. My gut tells me that the stack master looses sync with the unit which sporadically reboots and then reports the various components as having failed.

The configuration on this problematic switch stack is fairly simple, in that it defines a variety of LAGs balanced between the two stack members, defines a couple of layer 2 VLANs that are distributed over LAGs and has two interfaces configured to do VLAN stacking (double VLAN). The other stacks all have a similar configuration.

We are not doing layer 3 on these switch stacks, except for management of the devices themselves.

Is this a known issue?

schumaku · Answer

With both Power Supply -and- Fan control related errors I suspect a hardware issue on a specific internal controller.&nbsp;
&nbsp;
Consider challenging Netgear Support via HTTPS://my.netgear.com/ or direct the Netgear AV team, lead by&nbsp;LaurentMa​&nbsp; LaurentMa​ LaurentMa​ &nbsp;(no idea why this new community platform won't find his certainly existing account, but that's another issue In reporting since it was deployed ChristineT​ )&nbsp;
&nbsp;
Looks like a hardware warranty replacement is required.
&nbsp;
Regards,
-Kurt.

bbs2web · Answer

So did we, we however swapped the PSU in unit 2 and additionally replaced the power cable and routed it to another power distribution panel in the rack.

Downgrading to 12.0.17.12 has however stabilised things. It was a little better on 12.0.17.16 (lasting almost 12 hours but then restarting again 3 hours later), but stable for 1+ day now on 12.0.17.12.

It was restarting regularly (between 1-5 hours) when running 12.0.19.10

PS: The M4300-12X12F is unfortunately single corded. I believe that the warnings shown in the logs appear to be red herrings, in that monitoring to the 2nd unit when it restarts simply indicates read errors and then fails certain components (eg PSU, FANs, ports, etc).

bbs2web · Answer

Stable for 52+ hours after downgrading firmware, in our testing a stack of 2 x M4300-12X12F switches resulted in unit 2 restarting randomly between 1-5 hours of uptime when running:

12.0.19.10
12.0.19.7
12.0.17.16

Stable with:

12.0.17.12

Forum Discussion

M4300-12X12F unstable on 12.0.19.x

3 Replies

Related Content

M4300 stacking between M4300-28G and M4300-12X12F

40gb Ethernet - M4300 12X12F Stacking Aggregation

M4300-12X12F (XSM4324S) needs reboot to access admin interface

MS308 unstable connection

M4300-12X12F replacing M7100-24X - Wireless Access Points not working

NETGEAR Academy

ProSupport for Business