× NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.
Orbi WiFi 7 RBE973
Reply

Re: product suggestion: staggered reboot

jmozdzen
Tutor

product suggestion: staggered reboot

Dear Netgear team,

 

while fighting a different issue, I tried to solve by updating and rebooting our M4300-24x top-of-rack switches, where I want to share a suggestion for improvement:

 

We're using two M4300-24x in stacked mode as top-of-rack switches, with full redundancy - two intra-stack links, redundant uplinks to the core switches, all servers connected via LACP with at least two links (going to both -24X).

 

When updating the switches, a full reboot is required, for both stack units. This causes a service interrupt for much longer than seems required (tens of seconds), because both units are rebooted simultaniously.

 

I believe it should be possible to implement a different approach for controlled reboots:

  1. fail-over to the second stack module
    This leaves the network operational (in the old configuration) and because of the controlled nature of this operation, should be able to happen without traffic interruption (unlike when just power-failing the first module)
  2. reboot the first module
    This will activate the new sotware version on the first module. Of course the second module cannot simply be "on-boarded", because of the different software levels.
    The first module should be coming up as much as possible, but without actually forwarding traffic via its ports - because the second module is still active.
  3. "fail-over" from the second module
    I cannot fully judge the internals, but it should be possible to STONITH the second module and have the first module take over operations with much less service interruption than with a traditional parallel reboot of both switches.
  4. "On-board" the second module
    The second module needs no special handling, it can operate as with any power reboot: initialize and join the stack.

Of course, this two-stage reboot shoud be optional, because in non-redundant situations, I guess a parallel reboot of both/all stack nodes will get you back to fully operational mode faster.

 

Best regards,

Jens

Model: XSM4324CS|M4300-24X - Stackable Managed Switch with 24x10GBASE-T
Message 1 of 4
msi
Luminary
Luminary

Re: product suggestion: staggered reboot

Hi

 

This is sort of already possible and has been described by @LaurentMa in the following thread: https://community.netgear.com/t5/Managed-Switches/M4300-stack-vs-non-stack-VMware-and-iSCSI-setup/m-...(M4300 for iscsi storage)

 

... though manually and you need to be aware of some restrictions. In another thread (M4300 stack vs non stack, VMware and iSCSI setup) I've shared my experiences with that staged update method, how it worked in my case and where it has its shortcomings: https://community.netgear.com/t5/Managed-Switches/M4300-stack-vs-non-stack-VMware-and-iSCSI-setup/m-...

 

The method has some shortcomings and risks, but that sort of fully non-stop failover is really only possible with datacanter-focues switches usually playing in much higher price ranges than Netgear. And even there you need to look out if those switches support a staged update across a stack since definitely not all stackable switches (actually not many as I've found out) can.

Message 2 of 4
jmozdzen
Tutor

Re: product suggestion: staggered reboot

Hi,

 

thank you for the pointers - if I read the linked messages correctly, it's basically the manual process version of what I described above. I'll give that a try the next time I need to update the switches, in order to reduce the actual "network down-time" of that server rack.

 

Regards,

Jens

Message 3 of 4
msi
Luminary
Luminary

Re: product suggestion: staggered reboot

I suggest that you "exercise" this procedure on a lab switch or a "not-so-important" stack during a planned maintenance window to get some "feel" how this behaves, before doing an actual upgrade on a more critical stack. Also rebooting the complete stack as has tended to be safest method in my case, so that is likely why Netgear suggests it as the recommended method.

 

Though IMHO they could promote that staggered method a bit more publicly.

 

I usually test a new release first on a not stacked unit, test it on a less important stacked setup then move the upgrade to the important stacks. It has also proven helpful to verify NSF failover was working correctly by moving the stack master after maintenance back to the original stack master before maintenance. (In our setup by convenience unit 1 is the stack master, unit 2 the backup master by setting priorities accordingly).

 

It has also proven helpful in my case to get some experience with the failover behaviour on M4300 stacks. Reading the difference in the CLI manual between 'initiate failover' and 'movemanagement' was eye-opening during the first maintenance. 😉

Message 4 of 4
Top Contributors
Discussion stats
  • 3 replies
  • 1085 views
  • 0 kudos
  • 2 in conversation
Announcements