Forum Discussion

Aspirant

Sep 10, 2017

Solved

DHCP Response not traversing stacked switches

Hi, hopefully only another user error, but at the moment I am out of ideas again. I have 2 S3300's (FW 6.6.17) [A1,A2] connected via stacking in 2 different rooms. On one of the 10G Ports another ...

dhcp

stack

Troubleshooting

Hopchen
Sep 11, 2017
Alright, that is good! Thanks for clarifying.

I think we will focus on one direction for now (not the music band :smileylol:). Since your Windows server will be the DHCP server in the end, we can use that one and leave the Linux one turned off/disconnected for now.

I am not sure where the Windows server is connected, but let's just assume that it is connected to A1 in the stack and your client is connected to switch B. Connection flow is like this then:
DHCP server --> A1-stack --> A2-stack --> Switch B --> DHCP client.

You have already done some good troubleshooting with Wireshark. From my understanding, you have port mirrored and Wiresharked the link between A2-stack and switch B?

- You see the DHCP discover send from the client?
- You also see the DHCP offer from the server?
- Hereafter, you see nothing further?

Just to clarify how DHCP works:
1. Client sends a DHCP discover
2. Server sends a DHCP offer, offering an IP address to the client
3. Client checks that the offered address is not in use by anyone else (gratuitous ARP)
a. If the offered address is in use by someone else, the client won't use the offered address and sends a DHCP decline message to the server. Hereafter the client starts the DHCP discover process all over again.

b. If the offered address is not use then the client will accept the address, by sending a DHCP request to the server. A request to obtain the offered address.

4. The server will acknowledge to the client, with a DHCP ACK package, and then the server will register the DHCP entry in its DHCP binding table.

So, we gotta find out exactly where this does wrong. That will be important.

Let's try this:

- Port mirror the uplink between A2-stack and switch B, as you have done already. Run Wireshark to capture the traffic on that uplink.

- Packet capture the server as well, at the same time.

- Packet capture the client as well, at the same time.

So, we have 3 packet captures running now.

- Now, connect the DHCP client to switch B and see what happens. Filter the Wireshark captures by: bootp. This is so that you only see DHCP traffic.

What happens exactly?

The client sends the discover. The discover reaches the server?

The server replies with and offer? If so, does the offer reach the client?

etc.

As you have 3 Wiresharks running, you can see exactly where the problem lies. For example:

If the server sees the DHCP discover and sends the offer and the client sees the offer but never sends a request --> then we need to look into why the client didn't send the request.

or

If the server sees the DHCP discover and sends the offer, and you see the offer being forwarded on the link between A2-stack and switch B, but the client never sees the offer --> then we have to investigate why the offer was not forwarded to the client, by switch B.

etc.

I hope that makes sense? It is a bit of work, but this will give you a much better picture, because until we have the full picture we can only make educated guesses :)

Any questions, give me a shout!

Cheers

Hopchen

Prodigy

Sep 13, 2017

Hi,

It is weird indeed! I don't think MTU would play a role here. As you said, you see other traffic from the the client VM to DHCP server VM.

I can't tell you why switch B is not passing those DHCP offers back to the client, however, at least you know where the issues lies. The fact that you tested the port mirror and it shows it is working, but you still don't see the offers traverse across that uplink - that tells us switch B is not forwarding them.

ACLs could be the issue, if they are implemented on switch B. DHCP snooping is an obvious suspect, but you have checked that this is turned off on switch B. You have narrowed the issue down, but I am not sure how to solve this. I have seen similar issues in the past. "Strange" behaviour that was hard to make sense of. In most of those cases it was a corrupt config that was the issue. This is why I suggested to factory reset the switch, if possible.

If this was me, I would try and connect a different switch to the Netgear stack and move my DHCP server to that switch. If it works fine with that setup, I would start to move devices to that new switch temporarily and afterwards go "brute force" on switch B - i.e. reset it and reconfigure it from scratch.

But yes, it is a really tough one. I understand why you were perplexed here.

Cheers

rand__

Aspirant

Sep 13, 2017

So I might have a solution (if it is that then stupid user error) but I don't really understand it.

I was working on using an alternate NIC in one of the boxes to circumvent Switch B (instead directly attach to 10G Port).

Then I remembered that at some point in time (weeks ago when I started looking at it) I had DHCP working on a physical box attached to A1 but not a VM hosted on a box attached to A1.

Back then i checked the switch config and found that I was actively tagging VLAN 1 on the path between Switch B and A1, i.e. A2->Uplink B was tagging 1 as well as the stack port.

(Side question - do I need to manage the stacking port at all with vlans etc or is that config agnostic?)

So after reading and searching i found that the native VLAN (VLAN 1 on Netgear) is basically untagged traffic. Untagged trffic gets associated with VLAN 0 on ESX. So naturally I removed the tagging of VLAN 1 since I needed untagged traffic to flow into my ESX dvSwitch to reach the correct portGroup. Left that be ever since until i remembered this today.

So I reconfigured tagging for VLAN 1 on these 2 interfaces only - *not* on the interface going to the ESX box and now suddenly both physical and virtual client receive IPs?

I think I had changed the dvSwitch Config during debugging to a non trunked value - mabye that was the actual fix here.

Further tests:

I have just added VLAN1 Tagging on the A1 side (to ESX box) - now its not working any more - remove from ESX Port - works.

So it seems that I need Vlan1 Tagging on uplink Port to physical Switch B but must not use it on ESX dvSwitch uplink ports...

Hopchen
Prodigy
Sep 14, 2017
Hi,

I am glad to hear you have made progress.

However, I will agree that the VLAN should not have been the issue - based on your testing. I understand that you are confused with that part. See, if the VLAN was incorrect then the server would never even see the DHCP discover from the client - let alone respond with an offer. So, that is odd. But I am no VMware expert and I am not sure how those virtual NICs handle the traffic.

I am very suprised the VLAN settings on the port to the VMs seems to have been the issue. Surely, if that was the case then you wouldn't be able to ping across or even see each others broadcasts. But, whatever, it is working and that is good! :)

The tagging vs untagging comes down to this:
1. If the device in the other end is VLAN-aware, then Tag ("T") the traffic.
2. If the device in the other end is NOT VLAN-aware, then UNtag ("U") the traffic and set corresponding PVID.

So, between switches I would always Tag my traffic. Tagging the traffic also allows for multiple VLANs on a port, called a trunk. You can leave VLAN 1 run untagged across trunk-links if you prefer though (letting VLAN 1 run as what is called a Native VLAN - as you mention yourself). It is a matter of preference really.

As regards to your switch-ports conneting to the VMs. You can Tag ("T") those for VLAN 1 if you want, but you need to make sure that the VMs are also set to Tag themselves, for VLAN 1. It has to match.

Cheers
rand__
Aspirant
Sep 14, 2017
Hi,
first of all thank you very much for your help, its always good to have someone doublecheck and order the approach (and confirm one is not going crazy if its not working :p).

Second - its totally correct that I must not tag the traffic to the ESX switch since only untagged traffic will be handled as such and reach the target portgroup (which accepts only untagged traffic).
The weird thing, is that I needed to tag traffic from Switch B to A although both switches treat Vlan 1 as native. But it might relate to the trunk itself that either switch is not able to trunk *and* accept untagged traffic at the same time.
If so, then I'd assume that the tagged Vlan1 traffic is untagged since PVID matches at that port. Thus I'd need to have the mixed tagging settings (contrary to seemingly logical approach to have native vlan all-tagged or none-tagged ).

While I have you, would you mind answering the question regarding the need of configuring the stacking port at all?
Thanks a lot,
cheers:)
Hopchen
Prodigy
Sep 15, 2017
Hey,

No problem at all!

Sorry, I missed that question. No, you do not need to manage the stacking port at all.

As for the trunk, the NTGR switches can indeed Trunk with native VLAN, meaning all VLANs are tagged except one (the native VLAN) which is untagged + PVID. Not sure how switch B handles this though, but as you can tag all VLANs across, I would do that. I don't normally see much reason to run with a "native VLAN" on the Trunk between switches. I'd rather tag them all.

Any doubts, reach out! :)

Thanks

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology!

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

Learn More

Forum Discussion

DHCP Response not traversing stacked switches

Related Content

DHCP not traversing switch Part 2

Error: Missing JSON response.

RBR850 frequently issuing DNS REFUSED responses

bad ping response time to switches connected via uplink

RBS50 SATELLITE NOT RESPONSIVE

NETGEAR Academy

ProSupport for Business