I was asked to take a look at an issue with virtual machines losing network connectivity. The problems were described as follows:
Sometimes some VMs had connectivity, some times they didn’t. It was not tied to specific virtual machines. Sometimes the problem was not there, than it showed up again. It was, not an issue of a wrong subnet mask or gateway.
They suspected firmware or driver issues. Maybe it was a Windows NIC teaming bug or problems with DVMQ or NIC offload settings. There’s a lot of potential reasons, just Google Intermittent VM connectivity Issues Hyper-V and you’ll get a truckload of options.
So a round of wishful firmware, driver upgrading started. Followed by a round of wishful disabling network features. That’s one way to do it. But why not sit back an look at the issue.
Based on what they said I looked at the environment and asked it was tied to specific host as only VMs on one of the hosts had the issue. Could it be be after a live migration or a VM restart. They didn’t really know but it could. So we started looking at the hosts. All teams for the vSwitch were correctly configured on all host. No tagged VLAN on the member NIC. No extra team interfaces that would violate the rule that there can be only one if the team is used by a Hyper-V switch. They used the switch independent teaming mode with the load balancing mode set to Dynamic, all member active. Perfect.
I asked it they used tagged VLAN on the VMs some times. They said yes. Which gave me a clue they had trunking or general mode configured on the ports. So I looked at the switches to see what the port configuration was like? Guess what. All ports on both switches were correctly configured bar the ports of the vSwitch team members on one Hyper-V host. The one with problematic VMs. The two ports were in general mode but the port on the top switch had PVID* 100 and the one on the bottom switch had PVID 200. That was the issue. If the VM “landed” on the team member with PVID 200 it has no network connectivity.
* PVID (switchport general pvid 200) is the default VLAN of the port, in CISCO speak that would translate into “”native VLAN as in switchport trunk native vlan 200
Yes NIC firmware and drivers have issues. There are bugs or problems with advanced features once in a while. But you really do need to check that the configuration is correct and the design or setup makes sense. Do yourself a favor by not assuming anything. Trust but verify.