Attending the Veeam Vanguard Summit 2019

Attending the Veeam Vanguard Summit 2019

I am dedicating the week of the 14th -18th of October to attending the Veeam Vanguard Summit 2019. As a Veeam Vanguard, I am quite lucky to be invited to this event. I will meet fellow technologists, architects, and Vanguards from all over the world. This is a unique opportunity to learn with and from each other.

Attending the Veeam Vanguard Summit 2019
Veeam Vanguard Summit 2019

From Veeam there will be Product Strategists, Technical Analysts, Technologists, Research & Development team members as well as social media and marketing experts.

There is quite an impressive assembly of intellect, expertise, and knowledge. It is great to be in such company as there is much to learn and understand in the business of data protection and management. I sometimes feel out of my depth and comfort zone at this event. That is normal, For one, Veeam solutions and efforts span the complete breadth, depth, and width of the IT industry. On top of that, I am there to learn and then it is actually great that I am not the smartest one in the room while discussing a given subject.

Anyway, soon I will travel to Prague and make the most of my time there. As always I am looking forward to seeing many buddies and friends. And I get to meet the new Veeam team member and the new Veeam vanguards. This week will be one filled with technology and the business of data management discussions from breakfast to dinner. I love it. Thank you Veeam for inviting me over, I appreciate the opportunity!

Licensed Replay Manager Node Reports being unlicensed

Licensed Replay Manager Node Reports being unlicensed

I was doing a hardware refresh on a bunch of Hyper-V clusters. This meant deploying many new DELL PowerEdge R740 servers. In this scenario, we leverage SC Series SC7020 AFA arrays. These come with Replay Manager software which we use for the hardware VSS provider. On one of the replaced nodes, we ran into an annoying issue. Annoying in the fact that the licensed Replay Manager Node reports being unlicensed in the node’s application event log. The application consistent replays do work on that node. But we always get the following error in the application event log:

Product is not licensed. Use Replay Manager Explorer ‘Configure Server’ or  PowerShell command ‘Add-RMLicenseInfo’ to activate product license.

Product is not licensed. Use Replay Manager Explorer 'Configure Server' or  PowerShell command 'Add-RMLicenseInfo' to activate product license.

On the Replay Manager Explorer, we just see that everything is fine and licensed. Via the GUI or via PowerShell we could not find a way to “re-license” an already installed server node.

What we tried but did not help

This is not a great situation the be in, therefore we need to fix it. First of all we removed the problematic node from Replay Manager explorer and tried to re-add it. That did not help to be able to relicense it. Uninstalling the service on the problematic node also did not work. Doing both didn’t fix it either. We need another approach.

The fix

The trick to fixing the licensed Replay Manager Node reports being unlicensed is as follows. Stop the “Dell Storage Replay Manager Service” service.

Delete (or rename if you want to be careful) the Compellent folder under C:\ProgramData

Restart the “Dell Storage Replay Manager Service” service. As a result you will see the folder and the files inside being regenerated. Wait until the temp files (ReplayManager.db-shm and ReplayManager.db-wal) of this process are gone.

Open up Replay Manager Explorer or relaunch it for good measure if still open. Connect to the problematic node. Navigate to “Configure Server” On the license tab it reports that it is unlicensed. Now enter the license code and request confirmation (Activate via Internet) or Activate via phone.

The node is now licensed again.

The node is licensed again. The system needs to be configured.

The image above shows the node is licensed again. You now need to configure the system again because that info is lost. For that, enter the username and password for your SC Arrays and add the correct node.

We now test creating a replay! Most importantly, we check the node’s application event log. The error Product is not licensed. Use Replay Manager Explorer ‘Configure Server’ or  PowerShell command ‘Add-RMLicenseInfo’ to activate product license. has gone!

We only see the 3 informational entries (prepared, committed, successful) associated with a successful and completed replay.

Above all, I hope this helps others who run into this.

DELL released Replay Manager 8.0

DELL Released Replay Manager 8.0

On September the 4th 2019 DELL released Replay Manager 8.0 for Microsoft Servers. This brings us official Windows Server 2019 support. You can download it here: The Dell Replay Manager Version 8.0 Administrator’s Guide and release notes are here:

Replay Manager was released early September 2019

I have Replay Manager 8.0 up and running in the lab and in production. The upgrade went fast and easy. And everything kept working as expected. The good news is that Replay Manager 8 service is compatible with the Replay Manager 7.8 Manager and vice versa. This means there was no rush to upgrade everything asap. We could do smoke testing at a releaxed pace before we upgraded all hosts.

Replay Manager 8.0 adds official support for Windows Server 2019 and Exchange Server 2019. I have tested Windows Server 2019 with Replay Manager 7.8 as well for many months. I was taking snapshots every 30 minutes for months with very few issues actually. But no we had oficial support. Replay Manager 8.0 also introduces support for SCOS 7.4.

No improvements with Hyper-V backups

Now, we don’t have SCOS 7.4 running yet. This will take another few weeks to go into general available status. But for now, with both Windows Server 2016 and 2019 host we noticed the following dissapointing behaviour with Hyper-V workloads. Replay Manager 8 still acts as a Windows Server 2012 R2 requestor (backup software) and hence isn’t as fast and effectice as it could be. I actually do not expect SCOS version 7.4 to make a difference in this. If you leverage the hardware VSS provider with backup software that does support Windows Server 2016/2019 backup mechanisms for Hyper-V this is not an issue. For that I mostly leverage Veeam Backup & Replication.

It is a missed opportunity unless I am missing something here that after so many years Replay Manager requestor still does not support Windows Server 2016/2019 native Hyper-V backups capabilities. And once again, I didn’t even mention the fact that the indivual Hyper-V VMs backups need modernization in Replay Manager to deal with VM mobility. See I also won’t mention Live Volumes as I did then. Now as I leverage Replay Manager as a secondary backup method, not a primary I can live with this. But it could be so much better. I really need a chat with the PM for Replay Manager. Maybe at Dell Technologies World 2020 if I can find a sponsor for the long haul flight.

Fixing slow RoCE RDMA performance with WinOF-2 to WinOF

Fixing slow RoCE RDMA performance with WinOF-2 to WinOF

In this post, you join me on my quest of fixing slow RoCE RDMA performance with traffic initiated on WinOF-2 system to or from a WinOF system. I was working on the migration to new hardware for Hyper-V clusters. This is in one of the locations where I transition from 10/40Gbps to 25/50/100Gbps. The older nodes had ConnectX-3 Pro cards for SMB 3 traffic. The new ones had ConnectX-4 Lx cards. The existing network fabric has been configured with DCB for RoCE and has been working well for quite a while. Everything was being tested before the actual migrations. That’s when we came across an issue.

All the RDMA traffic workes very well except when initiated from the new servers. So ConnectX-3 Pro to ConnectX-3 Pro works fine. So does ConnectX-4 to ConnectX-4. When initiating traffic (send/retrieve) from a ConnectX-3 Pro server to a ConnectX-4 host it is also working fine.

However, when I initiated traffic (send/retrieve) from a ConnectX-4 server to a ConnectX-3 host the performance was abysmally slow. 2.5-3.5 MB/s is really bad. This need investigation. It smelled a bit like an MTU size issue. As if one side was configured wrong.

The suspects

As things are bad only when traffic was initiated from a ConnectX-4 host I suspected a misconfiguration. But that all checked out. We could reproduce it on every ConnectX-4 to any ConnectX-3.

The network fabric is configured to allow jumbo frames end to end. All the Mellanox NICs have an MTU size of 9014 and the RoCE frame size is set to Automatic. This has worked fine for many years and is a validated setup.

If you read up on how MTU sizes are handled by Mellanox you would expect this to work out well.

InfiniBand protocol Maximum Transmission Unit (MTU) defines several fix size MTU: 256, 512, 1024, 2048 or 4096 bytes.

The driver selects “active” MTU that is the largest value from the list above that is smaller than Eth MTU in the system (and takes in the account RoCE transport headers and CRC fields). So for example with default Ethernet MTU (1500 bytes) RoCE will use 1024 and with 4200 it will use 4096 as an “active MTU”. The “active_mtu” values can be checked with “ibv_devinfo”.

RoCE protocol exchanges “active_mtu” values and negotiates it between both ends. The minimum MTU will be used.

So what is going on here? I started researching a bit more and I found this in the release notes of the WinOF-2 2.20 driver.

Description: In RoCE, the maximum MTU of WinOF-2 (4k) is greater than the maximum MTU of WinOF (2k). As a result, when working with MTU greater than 2k, WinOF and WinOF-2 cannot operate together.

Well the ConnectX-3 Pro uses WinOF and the ConnectX-4 uses WinOF-2 so this is what they call a hint. But still, if you look at the mechanism for negotiating the RoCE frame size this should negotiate fine in our setup, right?

The Fix

Well, we tried fixing the RoCE frame size to 2048 on both ConnectX-3 Pro and ConnectX-4 NICs. We left the NIC MTU at 9014. That did not help at all. What did help in the end wat set the NIC MTU sizes to 1514 (default) and set the Max RoCE frame size to automatic again? With this setting, all scenarios in all direction do work as expected.

MTU size of 1514 to the rescue

Personally, I think the negotiation of the RoCE frame size, when set to automatic should figure this out correctly with the NIC MTU size set to 9014 as well. But I am happy I found a workaround where we can leverage RDMA during the migration, After that is done we can set the NIC MTU sizes back to 9014 when there is no more need for ConnectX-4 / ConnectX3 RDMA traffic.


The above behavior was unexpected based on the documentation. I argue the RoCE frame size negotiation should work thing out correctly in each direction. Maybe a fix will appear in a new WinOF-2 driver. The good news is that after checking many permutations I found came up with a fix. Luckily, we can leave jumbo frames configured across the network fabric. Reverting the NIC interfaces MTU size to 1514 on both side and leaving the RoCE frame size on auto on both sides works fine. So that is all that’s needed for fixing slow RoCE RDMA performance with WinOF-2 to WinOF. This will do just fine during the co-existence period and we keep it in mind whenever we encounter this behavior again.