DELL released Replay Manager 8.0

DELL Released Replay Manager 8.0

On September the 4th 2019 DELL released Replay Manager 8.0 for Microsoft Servers. This brings us official Windows Server 2019 support. You can download it here: https://www.dell.com/support/home/us/en/04/product-support/product/storage-sc7020/drivers The Dell Replay Manager Version 8.0 Administrator’s Guide and release notes are here: https://www.dell.com/support/home/us/en/04/product-support/product/storage-sc7020/docs

Replay Manager 8.0.0.13 was released early September 2019

I have Replay Manager 8.0 up and running in the lab and in production. The upgrade went fast and easy. And everything kept working as expected. The good news is that Replay Manager 8 service is compatible with the Replay Manager 7.8 Manager and vice versa. This means there was no rush to upgrade everything asap. We could do smoke testing at a releaxed pace before we upgraded all hosts.

Replay Manager 8.0 adds official support for Windows Server 2019 and Exchange Server 2019. I have tested Windows Server 2019 with Replay Manager 7.8 as well for many months. I was taking snapshots every 30 minutes for months with very few issues actually. But no we had oficial support. Replay Manager 8.0 also introduces support for SCOS 7.4.

No improvements with Hyper-V backups

Now, we don’t have SCOS 7.4 running yet. This will take another few weeks to go into general available status. But for now, with both Windows Server 2016 and 2019 host we noticed the following dissapointing behaviour with Hyper-V workloads. Replay Manager 8 still acts as a Windows Server 2012 R2 requestor (backup software) and hence isn’t as fast and effectice as it could be. I actually do not expect SCOS version 7.4 to make a difference in this. If you leverage the hardware VSS provider with backup software that does support Windows Server 2016/2019 backup mechanisms for Hyper-V this is not an issue. For that I mostly leverage Veeam Backup & Replication.

It is a missed opportunity unless I am missing something here that after so many years Replay Manager requestor still does not support Windows Server 2016/2019 native Hyper-V backups capabilities. And once again, I didn’t even mention the fact that the indivual Hyper-V VMs backups need modernization in Replay Manager to deal with VM mobility. See https://blog.workinghardinit.work/2017/06/02/testing-compellent-replay-manager-7-8/. I also won’t mention Live Volumes as I did then. Now as I leverage Replay Manager as a secondary backup method, not a primary I can live with this. But it could be so much better. I really need a chat with the PM for Replay Manager. Maybe at Dell Technologies World 2020 if I can find a sponsor for the long haul flight.

Latency kills

Introduction

I was investigating a very problematic Windows Server 2016 Hyper-V cluster. That cluster was just performing horribly. “Everything” was hanging, stalling, crashing and RHS.exe errors where flying around while WER dumps got created by the dozen. Things were extremely slow up to the points functionality was just failing. The “fun” thing was that the cluster validation wizard while slow gave that cluster a big thumb up and a supported status as all was well.

Prying around

Time to pry around a bit and see if we could find something wrong. We save live migrations stall, fail, last forever in pending or get stuck at a certain percentage, sometimes finally succeeding with ridiculous blackout times. We could not open up virtual machine properties or very slowly. The FCI GUI was highly unresponsive but so was the Hyper-V Manager GUI or even PowerShell. Those were hanging at even loading the virtual machines or enumerating them with Get-VM. Everything was slow to the point it timed out or crashed. Restarting the services (Cluster, Hyper-V) didn’t do anything and restarting VMMS was super slow or just got stuck. It was a depressing sight for which people tended to blame Hyper-V / Microsoft.

As the title gives away it was latency. Not just ordinary high latency. Real bad latency. That kind of latency kills. Extreme latency produces symptoms that are similar to bugs or corrupt components of roles and features. We have a tendency to look at those first in the event logs and then we look at the network and its usual suspects (VMQ, SET, DCB). But nothing pointed to an issue that I could find.

So, storage maybe?Well we did find one Hyper-V host in the cluster with one HBA port producing too many error so we disabled that FC port for testing. No joy the Hyper-V cluster after a clean reboot of all nodes remained problematic. So on to the storage array itself.

Well holy smoke! On the two volumes for CSV in those cluster we saw latencies that were so bad I could not even believe a single VM would boot. It actually made my appreciation for Hyper-V and clustering grow as it managed to do at least a couple of things. With such latencies I would expect the services to just crash & call it a day.

clip_image002

The horrific latency on one of the CSV LUNs.

Looking at the logs we saw that the latencies occurred on the FC HBAs of the controllers. Each one above 50ms, peaking to 150-250ms and one huge peak at almost 500ms. We saw this on all four HBA’s.

clip_image004

The latency on one of the 4 FC HBA’s on one of the controllers. Not a good day. All HBAs had high latencies like this.

The issues were not at the host level (host HBA’s) or not even at the IOPS/bandwidth level of the storage itself. The latency for some reason was spiking. Further investigation lead to the conclusion that the issue was related to synchronous replication going totally wrong. Moving the replication mode to asynchronous fixed that. We’re now investigating why this happened and how to prevent this from happening again. But that’s another story.

clip_image006

Latency on one of the 4 FC HBAs on one of the controllers after we fixed the issue.

Do not assume anything

So, there you go. Everything depends on everything in some direct or indirect way. It’s all connected and that my friends, is why I’m a proponent of “service resilience engineering” where the responsible team owns the entire stack. That’s is how you can act fast.

Windows Server 2016 RDMA and the Hyper-V vSwitch – Part I

Introduction

With Windows Server 2012 R2, using both RDMA and the Hyper-V vSwitch on the same host required separate physical network adapters (pNICs). There are 2 reasons for this.

  • First a vSwitch is generally created with a native Windows NIC team. Such a NIC team does not expose RDMA capabilities.
  • Second is that in Windows Server 2012 R2 you cannot expose RDMA capabilities via a vSwitch, even when you are using a non-teamed RDMA capable NIC.

As a result, the need for RDMA required more NICs on the Hyper-V hosts and/or a fully converged had some serious drawbacks. As servers have been quite capable and our VMs serve ever more intensive workloads this was not dramatic. Leveraging 2*10Gbps for a vSwitch and 2*10Gbps for redundant RDMA / SMB Direct traffic have long been one of my favorite designs. It leaves room for other traffic, such as backups, and it allows for high VM density. But with 40Gbps NICs that is overkill and a tad expensive in many scenarios, even when connecting to a SOFS share for Hyper-V storage, so 4*40Gbps on a Hyper-V host is not something I ever saw in real life.

Windows Server 2016 can expose RDMA capabilities via a vSwitch even without SET

What many people seem to have missed is that reason 2 has gone in Windows Server 2016 Hyper-V. Reason 1 still holds true. But that has been solved by Switch Embedded Teaming (SET). This means that you actually do not need SET to leverage RDMA with an vSwitch in Windows Server 2016 Hyper-V. You can do this as follows:

#Create a vSwith
New-VMSwitch -Name RDMACapable-vSwitch -NetAdapterName "NODE-A-S4P1-SW12P05-SMB1"

#Now add a host vNIC for the SMB Direct Traffic
Add-VMNetworkAdapter -SwitchName RDMACapable-vSwitch -Name SMB1 -ManagementOS

#Enable RDMA on it
Enable-NetAdapterRDMA "vEthernet (SMB1)"

#Grab that vNIC on the management OS and set the VLAN - PFC requires tagged VLANs
$NicSMB1 = Get-VMNetworkAdapter -Name SMB1 -ManagementOS
Set-VMNetworkAdapterVLAN -VMNetworkAdapter $NicSMB1 -Access -VlanID 110


Below is what this looks like. We have one vNIC on the management OS leveraging RDMA/SMB Direct consuming all 10Gbps if the NIC we connected to the vSwith. This is a nice lab demo but you can see this isn’t perhaps the best idea in real life.

clip_image002

Other things to note

Do realize this still requires the pNIC to be RDMA capable. This is not some sort of soft RoCE or other software RDMA magic as of today. The pNIC also has to have RDMA enabled or virtual NIC won’t be able to leverage RDMA but fall back to SMB (Multichannel only) instead of SMB Direct. Likewise, RDMA has to be enabled on the vNIC as well. So don’t forget, RDMA must be enabled on both the pNIC and the vNIC for this to work.clip_image004

DCB’s PFC/ETS requires a tagged VLAN to carry the priority, do don’t forget to tag the vNIC. There is actually no need to tag the pNIC as long as the switch port has the tagged VLAN set – most likely as a trunk or in general mode. If you don’t tag consistently across the entire network stack you’ll have network issues anyway and RDMA performance will be bad if it works at all.

Finally, don’t forget this is example is not using VMM /Network Controller and as such is using Set-VMNetworkAdapterVLAN and not Set-VMNetworkAdapterIsolation.

In real life, we need better and more than a single NIC vSwitch

The caveat here is that, while you have a converged setup, you have no redundancy for the vSwitch (there is no team). This also means that you’re are limited to a single NIC in regards to throughput for that vSwith. Depending on the needs of the solutions that might be perfectly fine. It it’s not – in most real-world scenarios you’ll need redundancy – you have to use SET in a converged scenario. That’s what we’ll take a look at in part 2. Then there is the question about QoS as you don’t want SMB Direct traffic to consume to much bandwidth at will. That’s still another issue to discuss and address.

Unable to correctly configure Time Service on non PDC Domain Controller

Introduction

Around new year, between the 31st 2016 and the 1st of January 2017 some ISP had issues with the time service. It jumped 24 hours ahead. This cause all kinds of on line services issues ranging from non working digital TV to problems with the time service within companies. That caused some intervention time and temporarily switching the external reliable NTP time server sources to another provider that didn’t show this behavior. Some services required a server reboot to sort things out but things were operational again. But it became clear we still had a lingering issue afterwards as we were unable to correctly configure Time Service on non PDC Domain Controller.

Unable to correctly configure Time Service on non PDC Domain Controller

A few days later we still had one domain, which happend to be 100% virtualized, with issues. As turned out the second domain controller, which did not hold the PDC role, wasn’t syncing with the PDC. No matter what we tried to get it to do so. If you want to find out how to do this properly for virtualized environment I refer you to a blog post by Ben Armstrong Time Synchronization in Hyper-V and fellow MVP Kevin Green Hyper V Time Synchronization on a Windows Based Network.

But no matter what I did, the DC kept  getting the wrong date. I could configure it to refer to the PDC as much as I wanted, nothing helped. It also kept saying the source for the time was the local CMOS (w32tm /query /source). I kept getting an error, we’re normally able to fix by configuring the time service correctly.

image

Another trouble shooting path

The IT universe was not aligned to let me succeed. So that’s when you quit … for a coffee break. You relax a bit, look out of the window whilst sipping from your coffee. After that you dive back in.

I dove into the registry settings for the Windows time service in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time of a functional DC in my lab and the one of the problematic DC in the production domain.  I started comparing the settings and it all seemed to be in order. But for one serious issue with the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Security key on the problematic DC.

Trying to open that key greeted me with the following error:

Error Opening key

Security cannot be opened. An error is preventing this key from being opened. Details: The system cannot find the file specified.

image

That key was empty. Not good!

image

I exported the entire W32Time registry key and the Security key as a backup for good measure on the problematic DC and I grabbed an export of the security key from the working DC (any functional domain joined server will do) and imported that into the problematic DC. The next step was to restart the time service but that wasn’t enough or I was to inpatient. So finally I restarted the DC and after 10 minutes I got the result I needed …

image

Problem solved Smile