In place upgrade of RD Gateway farm nodes to Windows Server 2016 removes the Loopback adapter for UDP load balancing

Here’s a quick heads up to anyone who’s involved in upgrading existing Windows Server 2012 (R2) RD Gateway farms to Windows Server 2016.

In my recent experiences the in place upgrade (VMs) works rather well. Just make sure the netlogon service is set to automatic (a know issue and a fix is coming) after you upgrade and install all updates. Also make sure that you don’t have this issue

Windows Time Service settings are not preserved during an in-place upgrade to Windows Server 2016 or Windows 10 Version 1607

There is however one networks specific issue specific you’ll need to deal with when leveraging UDP with a load balancer via Direct Server Return.

When you have a RD Gateway farm you load balance it with a (preferably high available) load balancer like a Kemp Loadmaster. I have described this in these blogs/videos Load balancing Hyper-V Workloads With High To Continuous Availability With a KEMP Loadmaster and Quick Demo Video Of Site Failover With KEMP Loadmaster Global Balancing

What you also do is load balance both HTTPS (TCP, port 443) and UDP (port 3391). For UDP we use Direct Server Return ((DSR) as described in my blog post Load balancing UDP for a RD Gateway farm with a KEMP Loadmaster. This requires a properly configured loopback adapter.

image

During the in place upgrade to Windows Server 2016 this loopback adapter is removed form the nodes. So you need to add it back just a described in my original blog post. Normally it will find the settings for it in the registry but it’s bets you check it all out as I’ve found that the loopback adapter did have “Register this connection”s address in DNS” enabled as well as NETBIOS over TCP/IP. So, per my blog post, check it all to make sure. Other than that, after installing all the Windows Server 2016 updates all works smoothly after an in place upgrade.

Hope this helps someone out there!

High Availability has a price

We’ll go back to basics today. Some times the obvious, no matter how evident it is to us technologists, is challenged. Recently we got the remark that we were wasting CPU cycles by assigning to many vCPU to certain virtual machines on our Hyper-V cluster. So we had to explain that high availability has a price. On top of that we had to explain that things are not as wasteful as they seem in a virtual environment.

The case

Here’s one of the “offending” virtual machines. They assumed that we are wasting at least 50% of 12 CPUs.

image

This is one node in a dual node  load balancing (active-active) and highly available solution. This provides for zero down time during scheduled maintenance and very little downtime during system failures.

And here’s the second node (yes the 1st node has been down for scheduled maintenance more recently that node 2).

image

In a 2 node HA solution you need to make sure that one node can handle the entire workload. This is the absolute border line of an N+1 solution.  This means you can lose 1 node. N determines the number of nodes needed to guarantee an agreed upon service level and the number defines how many nodes failures can be tolerated before affecting the service.

In the above example there’s a need to have the CPU resources on each node to run the entire workload on one node without having an effect on the service. Therefore, when both nodes are up this might seem like a waste to the uninitiated. It is however a required to achieve the high availability goal. A constant CPU usage over 75 % will lead to a reduction in service quality in this case and even compromise the usability of the that service.

I did not even dive into the dangers of designing purely based on averages during this “explanation”. That was one step to much for the level of the discussion.

It’s also important to note that Hyper-V CPU scheduling is highly intelligent and is far less susceptible to the waste of CPU cycles via over provisioning of vCPU than some other solutions are or used to be. Knowing the capabilities and inner working of the technology used is also important in all this. More nodes generally also make “over provisioning” less of an issue. When you have 10 nodes and you lose 1, you only have lost 10% of the capabilities, not 33% like in a 3 node cluster.

Ideally you have 3 node so that even during an issue with one node you still maintain redundancy. However if you want acceptable services during a 2 node failure you’ll need to go to N+2, meaning that you need 2 nodes to provide the services and handle losing 2 nodes gracefully. In that case you’ll need 4 node and so on.  The larger the node count  the wiser it is to go to a N+2 model and ideally you’ll provide separate failure domains over which the nodes are distributed. An example of this is having a redundant geo-load balanced web farm of 32 virtual machine nodes spread over 2 locations and running on separate hardware failover clusters in each location. As you can see the higher the stakes and demands the faster the cost and potential complexity rises. You can offload some of the complexity by leveraging a public cloud like Azure, but the costs will still be there. There is no such thing as a free lunch, some are quite easy and affordable for what you get.

Conclusion

High Availability has a price. I did mention that already, right? To be able to keep your services running at a level that is both workable and acceptable to your customers and stake holders you will need to over provision to a degree. There is no magic here. When your solutions are being scrutinized by people with no real background, experience and context in high availability you might need to explain this.

Load balancing UDP for a RD Gateway farm with a KEMP Loadmaster

When implementing load balancing for RD Gateway we must take care not to forget load balancing the UDP traffic. Now your RDP Connection will still work over HTTPS alone if you forget this, but you’ll miss out on the benefits.

  • Better experience of bad, unreliable network connections with high packet loss
  • Better experience with high end graphics and in general a better graphical experience over WAN links.

As many people have load balanced their gateways since Windows Server 2008 (R2) when UDP was not into play yet and as things work without people might forget. The most important thing you need to know is that when leveraging UDP for RDP 8/8.1 the UDP session traffic has to leverage Direct Server Return (DSR) for the real servers configuration when we configure load balancing for a RD gateway farm with a KEMP Loadmaster. I’m focusing on the UDP part here, not the HTTPS part. That’s been done enough and the Kemp info on that is sufficient. The UDP part could do with some extra info.

The reason for this is that when UDP is leveraged for high end graphics we want to avoid sending all that graphical network traffic the load balancer. There is no real added value being performed there in this UDP use case but the load might get quite high. This is where DSR is leveraged wen configuring the Loadmaster. That means we also need to configure our real servers to uses Direct Return as the forwarding method. When you forget this you’ll lose UDP with RDP 8.1 but you might not notice immediately. If you’re not looking for it as the HTTP connection alone will let you connect and work, albeit with a reduced experience.

To read more on why it’s done this way (even if it seems complex and has drawbacks) see http://kemptechnologies.com/ca/white-papers/direct-server-return-it-you/ you’ll notice that for graphics it is great idea. By selecting Direct Server Return as the  forwarding method (see later) changes the destination MAC address of the incoming packet on the fly (very fast) to the MAC address of one of the real servers. When the packet reaches the real server it must think it owns the VS IP address, which it doesn’t. So we use the loopback adapter to let the real server reply as if it does but we don’t respond to ARPs as that would cause issues with the load balancer who has the real IP of the virtual service. That’s where the 254 metric we configure in the demo below comes into play.  Note that  the real server responds over it normal NIC. Which is great and it helps with firewall rules not ruining the party. That’s why with DSR which leverages the the loopback adapter on the RD Gateway servers also requires you to configure the weak host / strong host behavior for the network configuration on those servers, it’s not answering itself! I’ll not go into details on this here but basically since Windows Vista and Windows Server 2008 the security model has change from weak host to strong host. This means that a system (that is not acting as a router) cannot send or receive any packets on a given interface unless the destination/source IP in the packet is assigned to the interface. In the “weak host” model, this restriction does not apply. Read more about this here. Let’s walk through this UDP/DSR/weak host setup & configuration.

On your Loadmaster you’ll create a virtual service for UDP traffic.

  • Select Virtual Services > Add New.
  • Enter the IP address of your RD Gateway Farm
  • Set 3391 as the Port.
  • Select udp for the Protocol.
  • Click Add this Virtual Service.

Open up the Standard Options to configure those

image

  • We don’t need layer 7 as the UDP connections are tied to the HTPP connection and they will spawn and die with that one.
  • We select Source IP Address as the Persistence Mode as the RD Gateway needs persistence to guarantee the connection stay together on the same RD Gateway server. Set the time out value no to high so it isn’t remembered to long.
  • We select least connections as that’s the best option in most cases, let the farm node with the least load take on new connections. This is handy after down time for example.

Now head over to the Real Servers section

image

  • Make sure the Real Server Check parameters is set to ICMP ping, which is what the LoadMaster uses to check if the RD Gateway servers are alive.
  • Click Add New to add an  RD Gateway server, you’ll do this for each farm member.

image

  • Enter the Real Server Address for each RD Gateway.
  • Enter 3391 as the Port.
  • Select Direct return as the Forwarding method.
  • Click Add This Real Server.

When you’re done it looks like this:

image

So now we need to check if the real servers are seen as on line and healthy …

image

If one RD Gateway server is down or has an issue you see this … no worries the LoadMaster sends all clients to the other farm member server.

image

Configure the  RD Gateway farm servers to work with DSR

We’re not done yet, we need to configure our RD Gateway servers in the farm to work with DSR.

Go to Device Manager, right-click on the computer name and select Add legacy hardware

image

Click next on the welcome part of the wizard …

image

Select “Install the hardware that I manually select from a list (Advanced)” and click Next …image

Scroll down to network adapters, select it and click Next …image

Under Manufacturer choose Microsoft and as Network Adapter scroll down to Microsoft KM-TEST Loopback Adapter, select it and click Next.

image

Click Next to install it …image

image

Click next to close the Wizard.image

 

Now go to  and change the name so you can easily identify the loopback adapter …imageimage

In the properties of the loopback adapter we disable everything we don’t need. In this case, we only need IPV4 and nothing else. We also need to configure the TCP/IP settings for the loopback adapter. So open up the TCP/IP v4 properties of that NIC …image

Enter the IP address of the Virtual Service for UDP on the load master and, very important enter a subnet mask of 255.255.255.255 for the loopback address. It’s a subnet of 1 host, the VIP IP address. Do not enter a gateway!

image

Now go to the advanced setting and deselect Automatic metric and fill out 254. This step prevents the server to respond to ARP requests for the MAC of the VIP with the MAC of the loop back adapter.

image

Also uncheck “Register this connection”s address in DNS” to avoid any name resolution problems for the real servers.

image

Finally disable NETBIOS over TCP/IP.

image

What we are doing with all the above is preventing any issues with normal network traffic to this real servers being affected by the loopback adapter who’s one and only function is to enable DSR and nothing else. It’s a bit “paranoid” but it pays to be and prevent problems.

Dealing with Strong Host / Weak Host setting in W2K8 and higher

We now still need to deal with the strong host security model and allow the LAN interface to receive traffic from the KEMP and allow the KEMP to receive and send traffic form/to the LAN interface. This is done by executing the following commands:

netsh interface ipv4 set interface LAN weakhostreceive=enabled
netsh interface ipv4 set interface KEMP-DSR-LOOPBACK weakhostreceive=enabled
netsh interface ipv4 set interface KEMP-DSR-LOOPBACK weakhostsend=enabled

That’s it. You should now have HTTP/UDP connections in your RD Gateway monitoring when using a load balancer and set it up correctly.  Remember if this isn’t configured correctly you’ll still connect but you lose the benefits the UDP connections offer.

Now another thing you need to be aware of in your RD Gateway configuration is that for UDP  to work with DSR is that the UDP Transport Settings need to be configured for “all unassigned” IP addresses. Other wise DRS won’t work and you’ll lose UDP. This make sense, you’ll receive traffic on the VIP on your real servers. It’s just like DSR with a web server where in IIS you’ll bind both the LAN and the loopback adapter to port 80 or 443 for the site.

image

We can see that one client is connected via RDSGW01 to two servers (Viking and Spartan) leveraging HTTP and UDP. The load balancing is done via the KEMP Loadmasters in  geo-redundant fashion.

image

Yes, my geo load balanced RD Gateway Server farms are providing UDP support for the servers and clients we  RDP in to.

image

Combined with those servers and clients being spread amongst the sites provides for enough business continuity to keep the shop running when a site fails, so it’s more than just connectivity!

A highly redundant Application Delivery Controller Setup with KempTechnologies

Introduction

The goal was to make sure the KempTechnologies LoadMaster Application Delivery Controller was capable to handle the traffic to all load balanced virtual machines in a high volume data and compute environment. Needless to say the solution had to be highly available.

A highly redundant Application Delivery Controller Setup with KempTechnologies

The environment offers rack and row as failure units in power, networking and compute. Hyper-V clusters nodes are spread across racks in different rows. Networking is high to continuously available allowing for planned and unplanned maintenance as well as failure of switches. All racks have redundant PDUs that are remotely managed over Ethernet. There is a separate out of band network with remote access.

The 2 Kemp LoadMasters are mounted a different row and different rack to spread the risk and maintain high availability. Eth0 & Eth2 are in active passive bond for a redundant management interface, eth1 is used to provide a secondary backup link for HA. These use the switch independent redundant switches of the rack that also uplink (VLT) to the Force10 switches (spread across racks and rows themselves). The two 10GBps ports are in an active-passive bond to trunked ports of the two redundant switch independent 10 Gbps switches in the rack. So we also have protection against port or cable failures.

image

Some tips: Use TRUNK for the port mode, not general with DELL switches.

This design allows gives us a lot of capabilities.We have redundant networking for all networks. We have an active-passive LoadMasters which means:

  • Failover when the active on fails
  • Non service interrupting firmware upgrades
  • The rack is the failure domain. As each rack is in a different row we also mitigate “localized” issues (power, maintenance affecting the rack, …)

Combine this with the fact that these are bare metal LoadMasters (DELL R320 with iDRAC –  see Remote Access to the KEMP R320 LoadMaster (DELL) via DRAC Adds Value) we have out of band management even when we have network issues. The racks are provisioned with PDU that are managed over Ethernet so we can even cut the power remotely if needed to resolve issues.

Conclusion

The results are very good and we get “zero ping loss” failover between the LoadMaster Nodes during testing.

We have a solid, redundant Application Deliver Controller deployment that does not break the switch independent TOR setup that exists in all racks/rows. It’s active passive on the controller level and active-passive at the network (bonding) level. If that is an issue the TOR switches should be configured as MLAGs. That would enable LACP for the bonded interfaces. At the LoadMaster level these could be configured as a cluster to get an active-active setup, if some of the restrictions this imposes are not a concern to your environment.

Important Note:

Some high end switches such as the Force10 Series with VLT support attaching single homes devices (devices not attached to both members on an VLT). While VLT and MLAG are very similar MLAGs come with their own needs & restrictions. Not all switches that support MLAG can handle single homed devices. The obvious solution is no to attach single homed devices but that is not always a possibility with certain devices. That means other solutions are need which could lead to a significant rise in needed switches defeating the economics of affordable redundant TOR networking (cost of switches, power, rack space, operations, …) or by leveraging MSTP and configuring a dedicates MSTP network for a VLAN which also might not always be possible / feasible so solve the issue. Those single homed devices might very well need to be the same VLANs as the dual homed ones. Stacking would also solve the above issue as the MLAG restrictions do not apply. I do not like stacking however as it breaks the switch independent redundant network design; even during planned maintenance as a firmware upgrade brings down the entire stack.

One thing that is missing is the ability to fail over when the network fails. There is no concept of a “protected” network. This could help try mitigate issues where when a virtual service is down due to network issues the LoadMaster could try and fail over to see if we have more success on the other node. For certain scenarios this could prevent long periods of down time.