November 2019 updates caused Windows Server 2012 reboot loop

Introduction

Some laggards that still have some Windows Server 2012 virtual machines running got a bit of a nasty surprise last weekend. A number of them went into a boot loop. Apparently the November 2019 updates caused Windows Server 2012 reboot loop. Well, we have dealt with update issues before like here in Quick Fix Publish : VM won’t boot after October 2017 Updates for Windows Server 2016 and Windows 10 (KB4041691). No need to panic.

Symptoms

Not all Windows Server 2012 virtual machines were affected. We did not see any issues with Windows Server 2012 R2, 2016 or 2019. Well the symptom is a reboot loop and below I have a sequence of what it looked like visually.

November 2019 updates caused Windows Server 2012 reboot loop
Restarting
Looking good so far …
OK, stage 2 of 4 … stlil seems OK.
November 2019 updates caused Windows Server 2012 reboot loop
Could still be OK … but no, its starts again in an endless loop.

So what caused this Windows Server 2012 reboot loop?

Fix

We turned of the virtual machines on which the November 2019 updates caused Windows Server 2012 reboot loop. We started them up again in “Safe mode” which completed successfully. Finally, we then did a normal reboot and that completed as well. All updates had been applied bar one. That was the 2019-11 Servicing Stack Update for Windows Server 2012 (KB4523208).

We manually installed it via Windows update and that succeeded.

When reading the information about this update https://www.catalog.update.microsoft.com/ScopedViewInline.aspx?updateid=5fa2a68f-e7cd-43c7-a48a-5e080472cb77 its states it need to be installed exclusively.

Maybe that was the root cause. It got deployed via WSUS with the other updates for November 2019.

Anyway, all is well now. I remind you that we have at least 2 ways of restoring those virtual machines. In case we had not been able to fix them. Have known good backups and a way to restore them people.

Conclusion

We fixed the issue and patched those servers completely. So all is well now. Except for the fact that they now , once more, have been urged to get of this operating system asap. You don’t go more than N-2 behind. It incurs operational overhead and risks. They did not test updates against this old server VM and got bitten. Technology debt without a plan is never worth it.

Licensed Replay Manager Node Reports being unlicensed

Licensed Replay Manager Node Reports being unlicensed

I was doing a hardware refresh on a bunch of Hyper-V clusters. This meant deploying many new DELL PowerEdge R740 servers. In this scenario, we leverage SC Series SC7020 AFA arrays. These come with Replay Manager software which we use for the hardware VSS provider. On one of the replaced nodes, we ran into an annoying issue. Annoying in the fact that the licensed Replay Manager Node reports being unlicensed in the node’s application event log. The application consistent replays do work on that node. But we always get the following error in the application event log:

Product is not licensed. Use Replay Manager Explorer ‘Configure Server’ or  PowerShell command ‘Add-RMLicenseInfo’ to activate product license.

Product is not licensed. Use Replay Manager Explorer 'Configure Server' or  PowerShell command 'Add-RMLicenseInfo' to activate product license.

On the Replay Manager Explorer, we just see that everything is fine and licensed. Via the GUI or via PowerShell we could not find a way to “re-license” an already installed server node.

What we tried but did not help

This is not a great situation the be in, therefore we need to fix it. First of all we removed the problematic node from Replay Manager explorer and tried to re-add it. That did not help to be able to relicense it. Uninstalling the service on the problematic node also did not work. Doing both didn’t fix it either. We need another approach.

The fix

The trick to fixing the licensed Replay Manager Node reports being unlicensed is as follows. Stop the “Dell Storage Replay Manager Service” service.

Delete (or rename if you want to be careful) the Compellent folder under C:\ProgramData

Restart the “Dell Storage Replay Manager Service” service. As a result you will see the folder and the files inside being regenerated. Wait until the temp files (ReplayManager.db-shm and ReplayManager.db-wal) of this process are gone.

Open up Replay Manager Explorer or relaunch it for good measure if still open. Connect to the problematic node. Navigate to “Configure Server” On the license tab it reports that it is unlicensed. Now enter the license code and request confirmation (Activate via Internet) or Activate via phone.

The node is now licensed again.

The node is licensed again. The system needs to be configured.

The image above shows the node is licensed again. You now need to configure the system again because that info is lost. For that, enter the username and password for your SC Arrays and add the correct node.

We now test creating a replay! Most importantly, we check the node’s application event log. The error Product is not licensed. Use Replay Manager Explorer ‘Configure Server’ or  PowerShell command ‘Add-RMLicenseInfo’ to activate product license. has gone!

We only see the 3 informational entries (prepared, committed, successful) associated with a successful and completed replay.

Above all, I hope this helps others who run into this.

Force Mellanox ConnectX-4 Lx 25Gbps to 10Gbps speed

Introduction

As you might remember I wrote a blog post about SFP+ and SFP28 compatibility. In this i discuss future proofing your network investments and not having to upgrade everything all at once. One example is that buying 25Gbps NICs when your main network infrastructure is still on 10Gbps is not an issue. 25Gbps normally handles 10Gbps well so you don’t have do replace all parts in the fabric at the same time but you can start with either the network fabric or the server NICs. It’s a way of future proofing the investments you make today.

When installing Mellanox ConnectX-4 Lx 25Gbps NICs in a bunch of servers we hit an issue when connected them to the DELLEMC N4000 10Gbps switches. The intent is to replace these with 25/50/100Gbps in the future.

The links did not come up.

The links did not come up. The switch ports are normally forced 10 Gbps in our setups so we check that. The speed was indeed set fix to 10Gbps. When changing that to auto-negotiate the link would come up at 1Gbps.

Naturally you check everything from cabling to used transceivers (BCM84754 on the switches) but that all checked out. We also check the firmware on the switches to determine if they were up to date and perhaps a new version fixed a known issue related to this. But no hardware wise everything was up to date on the switches and on the NICs.

Note that these links worked fine when used with 10 Gbps cards like the ConnectX-3 Pro. The DELL branded transceivers on the switches were BCM84754 (Broadcom)

The fix: Force Mellanox ConnectX-4 Lx 25Gbps to 10Gbps speed

I do not need to tell you that when you want 10Gbps getting 1Gbps doesn’t fly well. The fix was easy enough. We put the switch ports back to 10Gbps fixed speed. Auto-negotiate doesn’t deliver. No worries we fix the ports anyway. We then used mlc5cmd.exe Mellanox tool to change the NIC ports from auto-negotiate to fixed.

On hosts with Mellanox Connect-X4 NICs you open an elevated command prompt.

Navigate to C:\Program Files\Mellanox\MLNX_WinOF2\Management Tools. Run the below command to check the current link speed.

mlx5cmd.exe -LinkSpeed -Name “MyNicName ” -Query

Note 10 and 25 Gbps are supported, so it’s autonegotiate.

We force the link speed to 10Gbps:

mlx5cmd.exe -LinkSpeed -Name “MyNicName ” -Set 10

Link speed is forced to 10Gbps

The link comes up at 10Gbps

Likewise you can force the link to 25Gbps. If you want to change it back to the default you can force the link speed to auto-negotiate again.

mlx5cmd.exe -LinkSpeed -Name “MyNicName” -Set 0

See https://community.mellanox.com/s/article/mlx5cmd-exe for more information on this tool.

Do note that the switch port also needs to be set to 10Gbps fixed. As you can see below the command will notify you when those are still on auto.

The change was done but still no uplink when the switch port isn’t fixed to 10Gbps.

Conclusion

So my statement hold true the path to 25/50/100Gbps is one you can do step by step with future proofing. You might run into some issues but these are normally fixable. I have shared with you how to fix failing or wrong speed negotiations on 25 Gbps RDMA NICs (Mellanox ConnectX-4 Lx) when connecting to 10Gbps ports. I’m pretty sure the same holds true for other models. I have also had cards where things work out of the box but don’t give up when you hit an issue. I hope this helps some of you out there.

Replay Manager Configure Server There was an error loading the configuration information.

Replay Manager Configure Server There was an error loading the configuration information

When Replacing a bunch of servers with new DELL R740s (Hyper-V clusters, File clusters, backup targets etc.) I ran into an issue with the DELL Replay Manager software. The servers leverage multiple DELL EMC Storage Center SANs. The have multiple ones for Scale-Out, Redundancy, Failover, Mutliple Datacenters, …

With some of the servers I noticed that the loading of the information was slow, while most others were just fine. But with 4 out of all servers the connection never actually happens. The connectivity was just fine, and test connectivity confirmed this. As this had zero impact on the actual replays that were scheduled this went unnoticed. But when you are adding and removing servers you might need to dive into Server Configuration and that were after a minute we got the below error thrown

Configure Server
There was an error loading the configuration information.
Error Message:
The request channel timed out while waiting for a reply after 00:01:00. Increase the timeout value passed to the call to Request or increase the SendTimeout value on the Binding. The time allotted to this operation may have been a portion of a longer timeout
.

Notice that the GUI says connecting to our demo server82… but unless you need info from the server you might still see the info it get’s from the Storage Center SAN itself.

This is quite annoying as we need to be in there. So how to fix this. I have some ideas as I know this error from .NET WCF but in this case I was looking for an easier way out especially when I don’t have all the information about this 3rd party application. The good news is that it is easily fixed.

Fixing this

Replay manager stores the replays and metadata info about those replays it creates on the SAN itself. That’s why you can still see those even when you actually ca’t connect to the server. The config of servers you add and use in Replay Manager is stored locally where the client lived. This files is portable, just copy it form your profile and had it to a colleague. No big deal.

Now the server configuration you do from the Replay Manager GUI tool itself is stored on each and any server where you have the Replay Manager service installed. You will find that file, ReplayManager.config.xml, under C:\ProgramData\Compellent\ReplayManager.

Make a copy to be sure and edit the original using a text editor that has elevated permissions so you can save your changes. In the example file of one server below note that server82 (green) has 2 old Compellent SC entries (yellow) that are no longer in service. One SAN it cannot find won’t exceed the time-out windows, but it does slow the GUI down significantly. 2 or more phantom old SAN slow things down looking for them and you get the time out error.

The fix is easy, cut the key values out of the file and save the file. You then restart the Replay manager service on that server via an elevated command prompt (or use the GUI):
net start ReplayManager
net stop ReplayManager

Restart the Replay manager service on the server you need to manager before connecting to the server again with the Replay manager client tool GUI.

When you now close and launch the Replay Manager GUI and connect to the server things will be a lot faster and certainly wont time out anymore.

Conclusion

Maintain your environment. Try to remove and decommissioned storage center SAN from your server configurations in Replay Manager before you take it off line an dispose of it.I f you forget you and run into slow loading Replay Manager GUI or hit a time out. Don’t panic. The Replay manager is actually quite solid and recoverable. We have shown you how to fix this by editing the ReplayManager.config.xml file on the server you need to connect to but can’t.You basically just remove the references to the no longer existing storage centers I hope it helps some of you out there if you run into this. Feel free to reach out in the comments if you have any questions.