RDMA Over RoCE With DCB Requires Tagged Non Default VLANs

It’s DCB That Requires This

For those of you who are experimenting with the RoCE variant of RDMA for SMB Direct in Windows Server 2012 (R2), make sure you have a VLAN tag in your configuration if this is more than a simple RDMA over two NICs. The moment you get DBC with PFC & ETS involved you’ll need non default tagged VLANs. Do note that PFC alone is good enough, ETS is strictly speaking not a requirement, but I’d consider doing it if you can.

With Enhanced Transmission Selection (ETS) the network traffic type is classified using the priority value in the VLAN tag of the Ethernet frame. The priority value is the Priority Code Point (PCP), which is described in the IEEE 802.1Q specification and uses a 3-bit field in the VLAN tag with eight possible priority values (0 to 7).

Priority-based Flow Control (PFC) allows to individually pause priorities of tagged traffic and helps to provide lossless or “no drop” behavior for a certain priority at the receiving port. As  above, each frame transmitted by a sending port is tagged with a priority value (0 to 7) in the VLAN tag. So for the traffic pause and resume functionality to work we need a VLAN tag to carry the priority value.

Does It Work Without?

But you’ll tell me that, as you may be lacking a DCB capable switch for lab purposes, you used a direct cable between your two RoCE NICs. And guess what RoCE, might have indeed worked for you without a VLAN tag. You can test & get a feel for what RoCE/RDMA can do for you with just the NICs. But as there is no switch involved you’re not using DCB for PFC/ETS and without that the need for the tagged VLAN isn’t there. Also see https://blog.workinghardinit.work/2013/05/03/smb-direct-roce-does-not-work-without-dcbpfc/.

So there you go. Design your RoCE/RDMA network based on DCB with PFC( and ETS) and not just on the tests with an direct cable or you might miss a few details that are quite important. Happy testing!

Verifying SMB 3.0 Multichannel/RDMA Is Working In Windows Server 2012 (R2)

So you have spend some money on RDMA cards (RoCE in this example), spent even more money on 10Gbps Switches with DCB capabilities and last but not least you have struggled for many hours to get PFC, ETS, … configured. So now you’d like to see that your hard work has paid of, you want to see that RDMA power that SMB 3.0 leverages in action. How?

You could just copy files and look at the speed but when you have sufficient bandwidth and the limiting factor is in disk IO for example how would you know? Well let’s have a look below.

You can take a look at performance monitor for RDMA specific counters like “RDMA Activity” and “SMB Direct Connection”.image

Whilst copying six 3.4GB ISO files over the RDMA connection we see a speed of 1.05 GB/s. Not to shabby.  But hay nothing a good 10Gbps with TCP/IP can handle under the right conditions.image

It’s the RDMA counters in Performance Monitor that show us the traffic that going via SMB Direct.image

Another give away that RDMA is in play comes from Task Manager, Performance counters for the RDMA NIC => 1.3Mbps send traffic can’t possibly give us 1.05GB/s in copy speed magically Smile

image

When you run netstat –xan (instead of the usual –an) you get to see the RDMA connection. The mode is “Kernel” instead of the usual “TCP” or “UDP” with –an showing the TCP/IP connections/Listerners.

 image

If you want to go all geeky there is an event log where you look at RDMA events amongst others. Jose Baretto discusses this in Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Mellanox ConnectX-3 using 10GbE/40GbE RoCE – Step by Step with instructions how to use it. You’ll need to go to Event Viewer.On the menu, select “View” then “Show Analytic and Debug Logs”
Expand the tree on the left: Applications and Services Log, Microsoft, Windows, SMB Client, ObjectStateDiagnostic. On the “Actions” pane on the right, select “Enable Log”
You then run your RDMA work. And then disable the log to view the events. Some filtering & PowerShell might come in handy to comb through them.image

Design Considerations For Converged Networking On A Budget With Switch Independent Teaming In Windows Server 2012 Hyper-V

Last Friday I was working on some Windows Server 2012 Hyper-V networking designs and investigating the benefits & drawbacks of each. Some other fellow MVPs were also working on designs in that area and some interesting questions & answers came up (thank you Hans Vredevoort for starting the discussion!)

You might have read that for low cost, high value 10Gbps networks solutions I find the switch independent scenarios very interesting as they keep complexity and costs low while optimizing value & flexibility in many scenarios. Talk about great ROI!

So now let’s apply this scenario to one of my (current) favorite converged networking designs for Windows Server 2012 Hyper-V. Two dual NIC LBFO teams. One to be used for virtual machine traffic and one for other network traffic such as Cluster/CSV/Management/Backup traffic, you could even add storage traffic to that. But for this particular argument that was provided by Fiber Channel HBAs. Also with teaming we forego RDMA/SR-IOV.

For the VM traffic the decision is rather easy. We go for Switch Independent with Hyper-V Port mode. Look at Windows Server 2012 NIC Teaming (LBFO) Deployment and Management to read why. The exceptions mentioned there do not come into play here and we are getting great virtual machine density this way. With lesser density 2-4 teamed 1Gbps ports will also do.

But what about the team we use for the other network traffic. Do we use Address hash or Hyper-V port mode. Or better put, do we use native teaming with tNICs as shown below where we can use DCB or Windows QoS?

image

Well one drawback here with Address Hash is that only one member will be used for incoming traffic with a switch independent setup. Qos with DCB and policies isn’t that easy for a system admin and the hardware is more expensive.

So could we use a virtual switch here as well with QoS defined on the Hyper-V switch?

image

Well as it turns out in this scenario we might be better off using a Hyper-V Switch with Hyper-V Port mode on this Switch independent team as well. This reaps some real nice benefits compared to using a native NIC team with address hash mode:

  • You have a nice load distribution of the different vNIC’s send/receive traffic over a single member of the NIC team per VM. This way we don’t get into a scenario where we only use one NIC of the team for incoming traffic. The result is a better balance between incoming and outgoing traffic as long an none of those exceeds the capability of one of the team members.
  • Easy to define QoS via the Hyper-V Switch even when you don’t have network gear that supports QoS via DCB etc.
  • Simplicity of switch configuration (complexity can be an enemy of high availability & your budget).
  • Compared to a single Team of dual 10Gbps ports you can get a lot higher number of VM density even they have rather intensive network traffic and the non VM traffic gets a lots of bandwidth as well.
  • Works with the cheaper line of 10Gbps switches
  • Great TCO & ROI

With a dual 10Gbps team you’re ready to roll. All software defined. Making the switches just easy to use providers of connectivity. For smaller environments this is all that’s needed. More complex configurations in the larger networks might be needed high up the stack but for the Hyper-V / cloud admin things can stay very easy and under their control. The network guys need only deal with their realm of responsibility and not deal with the demands for virtualization administration directly.

I’m not saying DCB, LACP, Switch Dependent is bad, far from. But the cost and complexity scares some people while they might not even need. With the concept above they could benefit tremendously from moving to 10Gbps in a really cheap and easy fashion. That’s hard (and silly) to ignore. Don’t over engineer it, don’t IBM it and don’t go for a server rack phD in complex configurations. Don’t think you need to use DCB, SR-IOV, etc. in every environment just because you can or because you want to look awesome. Unless you have a real need for the benefits those offer you can get simplicity, performance, redundancy and QoS in a very cost effective way. What’s not to like. If you worry about LACP etc. consider this, Switch independent mode allows for nearly no service down time firmware upgrades compared to stacking. It’s been working very well for us and avoids the expense & complexity of vPC, VLT and the likes of that. Life is good.

Windows Server 2012 Supports Data Center Bridging (DCB)

Data Center Bridging (DCB) is a collection of standards-based end-to-end networking technologies that allow Ethernet to act as the unified fabric for multiple types of traffic in the data center. You cannot put a bunch of traffic types / protocols on the same physical pipes if you have no way of guaranteeing that they will each get what they need when they need it based on priority and impact. Even with ludicrous over provisioning you could still run into issues and even if that’s not the case that’s a very expensive option. When you think about iSCSI, Remote Direct Memory Access (RDMA) and Fibre Channel over Ethernet (FCoE) you can see where the benefits are to be found. We just can keep adding network after network infrastructure for all these applications on a large scale.

  • Integrates with the standard Ethernet networks
  • Prevents congestion in NIC & network by reserving bandwidth for particular traffic types giving better performance for all
  • Windows 2012 provides support & control for DCB and allows to tags packets by traffic type
  • Provides lossless transport for mission critical workloads

You can see why this can be handy in a virtualized world evolving in to * cloud infrastructure. By enabling multiple traffic types to use an Ethernet fabric you can simplify & reduce the  network infrastructure (hardware & cabling).  In some environments this is a big deal. Imagine that a cloud provider does storage traffic over Ethernet on the same hardware infrastructure as the rest of the Ethernet traffic. You can get rid of the isolated storage-specific switches and HBAs reducing complexity, and operational costs. Potentially even equipment costs, I say potentially because I’ve seen the cost of some unified fabric switches and think your mileage may vary depending on the scale and nature of your operations.

Requirements for Data Center Bridging

DCB is based on 4 specifications by the DCB Task Group

  1. Enhanced Transmission Selection (IEEE 802.1Qaz)
  2. Priority Flow Control (IEEE 802.1Qbb)
  3. Datacenter Bridging Exchange protocol
  4. Congestion Notification (IEEE 802.1Qau)

3. & 4. are not strictly required but optional (and beneficial) if I understand things correctly. If you want to dive a little deeper have a look here at the DCB Capability Exchange Protocol Specification and have a chat with your network people on what you want to achieve.

You also need support for DCB in the switches and in the network adaptors. 

Finally don’t forget to run Windows Server 2012 as the operating systems Winking smile. You can find some more information on TechNet  Data Center Bridging (DCB) Overview but it is incomplete. More information is coming!

Understanding what it is and does

So, in the same metaphor of a traffic situation like we used with Data Center TCP  we can illustrate the situation & solution with traffic lanes for emergency services and the like. Instead of having your mission critical traffic stuck in grid lock like the fire department trucks below

image

You could assign an reserved lane, QOS, guaranteed minimal bandwidth, for that mission critical service.  Whilst you at it you might do the same for some less critical services that none the less provide a big benefit to the entire situation as well.

image