NIC Firmware/Driver Updates Reset Your RSS/VMQ optimizations

When optimizing your RSS/VMQ settings for maximum performance you’ll normally (I hope) do this in PowerShell. Save that script with some comments on why you configure it that way and make it part of your Hyper-V host deployment scripts

Why? Automation is king but you’ll need it again for sure. Why? Well there is this “tendency” that NIC firmware/driver updates reset your RSS/VMQ optimizations back to their defaults.That’s a bit of a bummer if you have to redo all the work instead of having a script ready to go. I have seen many a deployment where the configuration was missing after firmware/driver upgrades so please, check!

image

Figure: Where has my optimized configuration gone after a driver/firmware upgrade?

The good news is this isn’t a show stopper issue as things will keep working, but without your optimizations and with VMQ, depending on your NIC team setup for the vSwitch issues might occur. When doing NIC teaming for your virtual switch it’s important to get it right.  With switch dependent teaming (LACP/Static) the NICs in the team need to use overlapping processor sets (Min Queues). When doing switch independent teaming the NICs in the team need to use non-overlapping processor sets. So you need to configure each NIC in your team to use the different processors (Sum of Queues).

On top of that you might want to / should separate RSS/VMQ cores from each other. SMB Direct for CSV/LM will also help achieve this as there we leverage CPU offloading to the NIC.

Live Migration over SMB Direct leaves more CPU cycles for Virtual RSS (vRSS) in Windows Server 2012 R2

I recently (January 22nd 2014) gave a WebCast presentation for the Dutch Windows Management User Group (@WMUG_NL) in which I made the case for using SMB Direct with Live Migration to save CPU cycles other (VM) workloads. There are several areas where the CPU cycles are better spent but I used vRSS to show case one scenario.

We’re using a 2 node Windows Server 2012 R2 Hyper-V cluster on Dell PowerEdge R720 servers with Mellanox ConnectX-3 (CSV  &  live migration) and Intel X520-DA (Hyper-V switch), all 10Gbps.

This is what a CPU bottleneck looks like that can be solved by using vRSS in Windows Server 2012 R2.image

The host machines are Hyper Threading enabled. The virtual switch is attached to a switch independent NIC team with dynamic mode. In this setup it’s normal that the sending VM is leveraging both members while the receiving VM traffic is coming in over one member of the host team.

No let’s enable vRSS in the VM and see what this does for this picture.image

Pretty impressive isn’t it. DidierTest03 is the sending VM running on host A and DidierTest04 is the receiving VM that has vRSS enabled and is running on Host B. For vRSS you need both hosts and VMs to run Windows Server 2012 or Windows 8.1. You can see the load is spread across 7 vCPUs in the VM. DidierTest04 has 8 vCPUs. I configured vRSS in the VM to be able to use 7 vCPUs and leave vCPU 0, the default one, alone to handle those workloads.

image

Given multiple Logical CPUs & vCPUs we can get line speed with 10Gbps inside a virtual machine. This, ladies and gentlemen is a thing of beauty.

Now tell me, if you have business related needs for those CPU cycles why would you not offload the work that needs to be done for live migration to the NIC via SMB direct? This is about getting maximum VM density, performance & ROI form your infrastructure, whilst saving on servers, power and cooling. When you see the smile on your clients or bosses face, just say “you’re welcome” and smile back Open-mouthed smile.

Enabling Jumbo Frames Inside Virtual Machines Enhances Throughput & Reduces CPU Load

Let’s play a bit with a Windows Server 2012 R2 Hyper-V cluster with 2*10Gbps Intel X520 teamed switch independent and in dynamic mode, which is optimal for DVMQ. On these NICs we enabled Jumbo frames (and on the switches of course). That team is used to create a virtual switch for consumption by the virtual machines. The switches used are 2*DELL PowerConnect 8132F. So we have full fault tolerance. But the important thing to note is that this is commodity, quality hardware that we can leverage for great results.

Now we’ll compare 2 scenarios. In both test a sending VM will try to saturate a receiving VMs network bandwidth. Due to how this NIC teaming setup works, that’s about 10Gbps in a two member 2*10Gbps team. We work around this by leaving both VMs on the same host, so the traffic doesn’t need to pass across the wire. The VMs have VMQ & vRSS enabled with the host team members having VMQ enabled.

Jumbo frames disabled inside the VM (Both VMs on same host)

Without Jumbo Frames enabled in the Guest VM and all other things being equal the very best we can achieve non-sustained is 21Gbps (average +/- 17Gbps)  receiving traffic in a VM. Not bad, not bad at all.

image

In the picture you can see the host with DVMQ doing it’s job at the left while vRSS is at work in the VM. Pretty clear.

Jumbo frames enabled inside the VM

Now, let’s enable Jumbo frames in the VM.  Fire up PowerShell or use the GUI.

image

Don’t forget to do this on both the sending and the receiving VM Smile. Here we get +/- 30Gbps receiving traffic inside of the VM. A nice improvement isn’t it? Not just that but we consume less CPU resources as well! Sweet Smile

image

Useful Power at your finger tips or just showing off?

vRSS & DVMQ is one of may scalability & performance improvements in Windows 2012 R2. And yes, Jumbo Frames inside the VM do make a difference but in a 10Gbps environment it’s not  “in your face” that visible. The 10Gbps limit of a single NIC team member makes this your bottle neck. But it DOES help to reduce CPU cycles in that case. Just look at the two screen shots below.

No Jumbo frames in VM (sending & receiving VMs on different hosts)

image

We’re consuming 7% CPU resources on the host and 15% in the VM.

Jumbo Frames in VM (sending & receiving VMs on different hosts)

image

We’ve dropped down to 3% on the host and to 9% inside the VM. All bits help I say!

How far can we push this?

Again if you take the NIC Team bottleneck out of the way you can see some serious differences. Take a look at the screenshot below, that’s 36,2Gbps inside of a VM courtesy of vRSS during some other experiments. Tallyho!

image

So let’s face it, I guess we’ll need some faster memory (DDR4) and multiple 40Gbps/100Gbps Cards to see what the limits of Windows Server 2012 R2 are or to find out if we reached it. Right now the operating system is giving hardware people a run for their money.

Also note that if you have a number of VMs doing a lot of network IO you’ll be using quite a number of CPU cycles. While vRSS & DVMQ make this scale you might want to consider leveraging SMB Direct for the various tastes of live migration as this will definitely help you out on that front as the NIC will do the heavy lifting.

Reality Check

But perhaps we also need little reality check. While 100Gbps and DDR4 is very nice you might not need it for your current needs. When the environment is built right you’ll find that your apps are usually your limiting factor before the hardware, let alone Windows Server 2012 R2.  So why is knowing this important? Well I verify Microsoft claims so I can talk from experience and not just from what I read in a Microsoft presentation. Secondly you can trust that your investment in Windows Server 2012 R2 is going to carry you long and far. It’s future proofed and that’s god for you. Both when your when needs grow exponential and for the longevity of your environment. Third, we can leverage this to virtualize high through put environments and get the best possible results and ROI.

Also, please, please test & find out, verify what settings suit your environment best and just just blindly enable stuff. Good luck!

Windows Server 2012 R2 Virtual RSS (vRSS) In Action

The Need For Speed

My clients or employers are normally into lots of data, large files &  number crunching. That means they like CPU power, memory, bandwidth & IOPS. That doesn’t mean they want to spend fortunes but it does mean I got a standing order to get every last ounce of optimization and efficiency out of commodity hardware.

Today that means we prefer to work with DELL generation 12 servers like the R620 & R720 which are the best possible value for money you can get. They deliver all features in box at no extra cost / licensing crap and they can be fine tuned and optimized for top performance. Windows Server 2012 R2 can handle loads higher than the hardware today can throw at it so you want all you can get without breaking the bank. Add Intel X520/540, Mellanox (RoCE) or Cheslio (iWarp) 10Gbps cards and you’re ready to rumble with some nice PowerConnect 81XX Series or even the Force10 S4810 10Gbps switches.

CPU Power is not an issue, just bring money. You have 8-12 core CPUs. You can scale up to 4 socket systems & scale out as well. Licensing is probably your biggest concern here due to cost.

Memory?  DDR3 is readily available and compared to other costs cheap. DDR4 is on the way. For max performance you possibly won’t do Dynamic Memory to avoid a NUMA hit but otherwise it helps with efficiencies.

Network. We’ll you’ve seen/heard/read me on topics like 10Gbps, RDMA/SMB Direct, (Dynamic VMQ) but today we’ll look at virtual RSS. In Windows Server 2008 R2 VMQ helped us beyond the scalability issues of a single core on the host having to handle all interrupts for the traffic going the virtual machines. In Windows 2012 that got optimized with Dynamic VMQ. For workloads that need the absolute best performance & lowest latency we got SR-IOV support. That works fine but it has some potentially important draw backs on bot the manageability (NIC teaming needs to be done inside the guest) & security front (it bypasses the virtual switch, so no ACLs or NVGRE).

Today with Windows 2012 R2 we have a new optimization for network traffic in a virtual environment. It’s “Virtual RSS” or vRSS. I tell you it’s sweet and you’d do well to investigate this and enable it in your environment especially if, like we, you move lots of data around & virtualization is default. We don’t do physical unless for very strict reasons, i.e. it cannot be virtualized. Otherwise … no excuses, the economics and benefits are just to good not to do so. It’s not as supper low latency as SR-IOV but depending on your needs you might never notice and … it plays nicely with all other network features in Hyper-V.

Show me vRSS already!

Inside the guest we see vRSS in action as multiple cores are put in to action to handle the interrupts of incoming traffic. This takes away the single vCPU bottle neck.

image

And the result is a sweet 17.2Gbps from VM1 to VM2. For the sharp eyed ones amongst you, they are on the same host so yes in this case we get top bandwidth as the traffic doesn’t have to go across the wire over the 2*10Gbps NIC team but stays within the same host or better, vSwitch.

image

The GUI is very friendly and suggest I can go to 100Gbps networking hardware Smile Well, not yet I’m afraid, but I’ve taken note Winking smile.

Here’s what it looks like when the sending and receiving VM are on different hosts and the vSwitch is connected to a 2*10Gbps team, switch independent, dynamic mode. You “only” get 10Gbps as the team can send on all members but receive on only one. But that’s still fine.

image

So how this look like on the host?

image

Here you see Dynamic VMQ in action. To prevent one core of becoming overloaded the host puts more of them into service. This depends on the load and it’s dynamic. Hence Dynamic VMQ. Where VMQ was great it still a limiting factor as you used to be tied to one core / VM.

This means that our network traffic processing is no longer limited by the OS or better the use of a single core, both inside the VM and on the host. Our bottleneck now is the maximum throughput the NICs can deliver. In our 2 member NIC team that’s 20Gbps max under the right circumstances. Yup. Line speed. Need more? Throw 40Gbps NICs at the problem or even 100Gbps pretty soon. Windows Server 2012 R2 is ready for this.

Sharing the wealth

Now to make sure a bunch of these VMs on your cluster don’t starve the rest of the VM population with their greed for and lust after bandwidth you have the option of using Bandwidth management on the VM NIC.

image

This is one of the cases where this option can be very useful.

VM mobility without boundaries

Also consider this capability and this type of high workloads combined while leveraging SMB Direct. This offloads the processing of CSV traffic,  Live Migration, Shared Nothing Live Migration and under certain conditions Storage Live Migration to the NIC by leveraging RDMA.  In other words it doesn’t tax your host CPUs. This means you can have these kinds of network traffic loads going on and still live migrate at will. Scalable VM mobility anyone? You’ll understand what tremendous network loads the combination Windows Server 2012 R2 Hyper-V host & guests can handle.

It’s virtualization, not magic

OK, time for a little reality check, just in case it’s needed. Virtualization is technology, not magic. For all of you “thinking” they can push 20GBps into 20 VMs simultaneously on single host over just a single Team of 2 *10Gbps … ah well, get real Smile OK?

Teaser

I can talk about the benefits of vRSS all day and show you some more screenshots but I’m working on “vRSS, The Movie’”.  Perhaps even a sequel already.