M4300-24x: spontaneous reboots dispite latest firmware

Tutor

Sep 18, 2018

So we've had a number of spontaneous reboots again over the last weeks/months. As we now have a redundant stack, operations aren't affected that much anymore and we usally only find out by looking at the syslog messages.

Interestingly, it's always the same switch that is rebooting, stack switch #1, no matter if it is the "active unit" (after power on of the complete stack) or the backup unit (after the first spontaneous reboot of this switch #1, which makes stack switch #2 the "active unit").

As I had a confguration issue clobbering the log with it's messages, Iwanted to clean that up first, before reporting back - I now have a clean log, here's what I found for the recent reboot:

--- cut here ---

Sep 13 21:27:32 s-22455-02-05 TRAPMGR[spmTask]: traputil.c(763) 1045333 %% Stack-port link down: Index: 217 Unit: 2 Tag: 0/24
Sep 13 21:27:33 s-22455-02-05 TRAPMGR[spmTask]: traputil.c(763) 1045335 %% Stack-port link down: Index: 216 Unit: 2 Tag: 0/23
Sep 13 21:27:34 s-22455-02-05 CKPT[tCkptSvc]: ckpt_task.c(363) 1045336 %% Checkpoint message transmission to unit 1 failed for LLDP(85).
Sep 13 21:27:34 s-22455-02-05 CKPT[tCkptSvc]: ckpt_task.c(363) 1045337 %% Checkpoint message transmission to unit 1 failed for LLDP(85).
Sep 13 21:27:34 s-22455-02-05 CKPT[tCkptSvc]: ckpt_task.c(363) 1045338 %% Checkpoint message transmission to unit 1 failed for LLDP(85).
Sep 13 21:27:39 s-22455-02-05 TRAPMGR[dot1s_task]: traputil.c(763) 1045339 %% Spanning Tree Topology Change Received: MSTID: 0 lag 21
Sep 13 21:27:40 s-22455-02-05 TRAPMGR[dot1s_task]: traputil.c(763) 1045340 %% Spanning Tree Topology Change Received: MSTID: 0 lag 21
Sep 13 21:27:42 s-22455-02-05 TRAPMGR[dot1s_task]: traputil.c(763) 1045341 %% Spanning Tree Topology Change Received: MSTID: 0 lag 21
Sep 13 21:27:46 s-22455-02-05 CKPT[tCkptSvc]: ckpt_task.c(487) 1045343 %% Backup manager removed.
Sep 13 21:27:46 s-22455-02-05 VOIP[tCkptSvc]: voip_ckpt.c(174) 1045344 %% Backup unit gone
Sep 13 21:27:46 s-22455-02-05 UNITMGR[unitMgrTask]: unitmgr.c(8116) 1045345 %% No Potential unit to configure as Standby when unit 1 left
Sep 13 21:27:52 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045347 %% Entity Database: Configuration Changed
Sep 13 21:28:16 s-22455-02-05 DRIVER[hapiL3AsyncTask]: broad_hpc_rpc.c(1041) 1045348 %% hpcHardwareRpc: RPC Timeout for transaction 6018
Sep 13 21:28:16 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 1045349 %% Interface 1/0/1 detached from ndesan01.
Sep 13 21:28:16 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 1045350 %% Interface 1/0/2 detached from ndesan02.
Sep 13 21:28:16 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 1045351 %% Interface 1/0/3 detached from ndesan03.
Sep 13 21:28:16 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 1045352 %% Interface 1/0/4 detached from ndesan04.
Sep 13 21:28:16 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 1045353 %% Interface 1/0/5 detached from ndemds01.
Sep 13 21:28:16 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 1045354 %% Interface 1/0/9 detached from compute1.
Sep 13 21:28:16 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 1045355 %% Interface 1/0/10 detached from compute2.
Sep 13 21:28:16 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 1045356 %% Interface 1/0/11 detached from compute3.
Sep 13 21:28:16 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 1045357 %% Interface 1/0/12 detached from compute4.
Sep 13 21:28:16 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 1045358 %% Interface 1/0/17 detached from nde32.
Sep 13 21:28:16 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 1045359 %% Interface 1/0/18 detached from control1.
Sep 13 21:28:17 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(1014) 1045360 %% Interface 1/0/21 detached from s-22455-02-01.
Sep 13 21:28:22 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045362 %% Entity Database: Configuration Changed
Sep 13 21:28:39 s-22455-02-05 TRAPMGR[spmTask]: traputil.c(763) 1045364 %% Stack-port link up: Index: 216 Unit: 2 Tag: 0/23
Sep 13 21:28:39 s-22455-02-05 TRAPMGR[spmTask]: traputil.c(763) 1045366 %% Stack-port link down: Index: 216 Unit: 2 Tag: 0/23
Sep 13 21:28:39 s-22455-02-05 TRAPMGR[spmTask]: traputil.c(763) 1045368 %% Stack-port link up: Index: 217 Unit: 2 Tag: 0/24
Sep 13 21:28:39 s-22455-02-05 TRAPMGR[spmTask]: traputil.c(763) 1045370 %% Stack-port link up: Index: 216 Unit: 2 Tag: 0/23
Sep 13 21:28:40 s-22455-02-05 TRAPMGR[spmTask]: traputil.c(763) 1045372 %% Stack-port link down: Index: 216 Unit: 2 Tag: 0/23
Sep 13 21:28:40 s-22455-02-05 TRAPMGR[spmTask]: traputil.c(763) 1045374 %% Stack-port link up: Index: 216 Unit: 2 Tag: 0/23
Sep 13 21:28:56 s-22455-02-05 TRAPMGR[spmTask]: traputil.c(763) 1045387 %% Stack-port link up: Index: 116 Unit: 1 Tag: 0/23
Sep 13 21:28:56 s-22455-02-05 TRAPMGR[spmTask]: traputil.c(763) 1045388 %% Stack-port link up: Index: 117 Unit: 1 Tag: 0/24
Sep 13 21:28:59 s-22455-02-05 CKPT[tCkptSvc]: ckpt_task.c(523) 1045418 %% New backup manager selected, unit 1.
Sep 13 21:28:59 s-22455-02-05 CKPT[tCkptSvc]: ckpt_task.c(423) 1045419 %% Checkpoint operation to backup unit 1 complete.
Sep 13 21:29:06 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045421 %% Link Up: 1/0/21
Sep 13 21:29:07 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045427 %% Link Up: 1/0/5
Sep 13 21:29:07 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045428 %% Link Up: 1/0/10
Sep 13 21:29:08 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045429 %% Link Up: 1/0/9
Sep 13 21:29:08 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045430 %% Link Up: 1/0/12
Sep 13 21:29:08 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045431 %% Link Up: 1/0/11
Sep 13 21:29:09 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045433 %% Link Up: 1/0/17
Sep 13 21:29:09 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045434 %% Entity Database: Configuration Changed
Sep 13 21:29:10 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(951) 1045435 %% Interface 1/0/21 attached to s-22455-02-01.
Sep 13 21:29:10 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045437 %% Link Up: 1/0/2
Sep 13 21:29:10 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(951) 1045438 %% Interface 1/0/11 attached to compute3.
Sep 13 21:29:10 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(951) 1045439 %% Interface 1/0/10 attached to compute2.
Sep 13 21:29:10 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(951) 1045440 %% Interface 1/0/12 attached to compute4.
Sep 13 21:29:10 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045442 %% Link Up: 1/0/1
Sep 13 21:29:10 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045444 %% Link Up: 1/0/3
Sep 13 21:29:10 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(951) 1045445 %% Interface 1/0/9 attached to compute1.
Sep 13 21:29:10 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(951) 1045446 %% Interface 1/0/5 attached to ndemds01.
Sep 13 21:29:11 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045448 %% Link Up: 1/0/18
Sep 13 21:29:11 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(951) 1045449 %% Interface 1/0/17 attached to nde32.
Sep 13 21:29:12 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(951) 1045450 %% Interface 1/0/2 attached to ndesan02.
Sep 13 21:29:12 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(951) 1045451 %% Interface 1/0/1 attached to ndesan01.
Sep 13 21:29:12 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(951) 1045452 %% Interface 1/0/3 attached to ndesan03.
Sep 13 21:29:14 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(951) 1045453 %% Interface 1/0/18 attached to control1.
Sep 13 21:29:15 s-22455-02-05 TRAPMGR[trapTask]: traputil.c(721) 1045455 %% Link Up: 1/0/4
Sep 13 21:29:18 s-22455-02-05 DOT3AD[dot3ad_core_lac]: dot3ad_db.c(951) 1045457 %% Interface 1/0/4 attached to ndesan04.
Sep 13 21:29:35 s-22455-02-05 UNITMGR[umWorkerTask]: unitmgr.c(7040) 1045459 %% Copy of running configuration to backup unit complete
Sep 13 21:30:58 s-22455-02-05 STACKING[spmTask]: spm.c(1434) 1045464 %% Errors detected on stack-port 1/0/23 (oldRxErrors = 0 currentRxErrors = 1 oldTxErrors = 0 currentTxErrors = 0). Use the stacking diagnostics command to look at detailed statistics.
Sep 13 21:30:58 s-22455-02-05 STACKING[spmTask]: spm.c(1434) 1045465 %% Errors detected on stack-port 1/0/24 (oldRxErrors = 0 currentRxErrors = 1 oldTxErrors = 0 currentTxErrors = 0). Use the stacking diagnostics command to look at detailed statistics.

--- cut here ---

The next message in the log is from Sep 18.

The stack is make up of two M4300-24x, connected via direct stack link cables in 1/0/23 and 1/0/24 to 2/023 and 2/0/24. (The reports about errors on these two ports only every appear right after the module reboot, similar to the two last lines above.)

LAG 21, reporting the STP topo change, is the redundant uplink to a central switch via 1/0/21 and 2/0/21.

The other ports on module 1 that are going down and up are connected redundantly to a server each, with the servers' second link going to the same port on switch module #2, running LACP on the LAG.

I collected all switch reports I could get hold of, to our tftp server, via CLI's "copy nvram:..." commands. The crash log is again an empty file, although the CLI command reported a successful transfer. All other files are transfered successfully resulting in "content" on the TFTP server.

As it is the same physical switch rebooting, I tend to believe it is a hardware error of some sort.

Any idea on how to proceed? Is unit 1 cndidate for a replacement?

With regards,

Jens

Forum Discussion

M4300-24x: spontaneous reboots dispite latest firmware

Related Content

M4300-12X12F replacing M7100-24X - Wireless Access Points not working

M4300-24x

M4300 24X: sudden restarts

M4300-24x: switching ports to "stack" mode needs reboot?

M4300-24X - multiple sudden restarts

NETGEAR Academy

ProSupport for Business