Quick Assist: CredSSP encryption oracle remediation Error

In the past 12 hours I’ve seen the first mentions of people no longer being able to connect over RDP via a RD Gateway to their clients or servers. I also got a call to ask for help with such an issue. The moment I saw the error message it rang home that this was a known and documented issue with CredSSP encryption oracle remediation, which is both preventable and fixable.

The person trying to connect over RD Gateway get the following message:
[Window Title]
Remote Desktop Connection
[Content]
An authentication error has occurred.
The function requested is not supported
Remote computer: target.domain.com
This could be due to CredSSP encryption oracle remediation.
For more information, see
https://go.microsoft.com/fwlink/?linkid=866660
[OK]

image

Follow that link and it will tell you all you need to know to fix it and how to avoid it.
A remote code execution vulnerability (CVE-2018-0886) exists in unpatched versions of CredSSP. This issue was addressed by correcting how CredSSP validates requests during the authentication process.

The initial March 13, 2018, release updates the CredSSP authentication protocol and the Remote Desktop clients for all affected platforms.
Mitigation consists of installing the update on all eligible client and server operating systems and then using included Group Policy settings or registry-based equivalents to manage the setting options on the client and server computers. We recommend that administrators apply the policy and set it to  “Force updated clients” or “Mitigated” on client and server computers as soon as possible.  These changes will require a reboot of the affected systems. Pay close attention to Group Policy or registry settings pairs that result in “Blocked” interactions between clients and servers in the compatibility table later in this article.

April 17, 2018:
The Remote Desktop Client (RDP) update update in KB 4093120 will enhance the error message that is presented when an updated client fails to connect to a server that has not been updated.

May 8, 2018:
An update to change the default setting from Vulnerable to Mitigated (KB4103723 for W2K16 servers) and KB4103727 for Windows 10 clients. Don’t forget the vulnerability also exists for W2K12(R2) and lower as well as equivalent clients.

The key here is that with the May updates change the default for the new policy setting changes the default setting from to mitigated.

Microsoft is releasing new Windows security updates to address this CVE on May 8, 2018. The updates released in March did not enforce the new version of the Credential Security Support Provider protocol. These security updates do make the new version mandatory. For more information see “CredSSP updates for CVE-2018-0886” located at https://support.microsoft.com/en-us/help/4093492.

This can result in mismatches between systems at different patch levels. Which is why it’s now more of a wide spread issue. Looking at the table in the article and the documented errors it’s clear enough there was a mismatch. It was also clear how to fix it. Patch all systems and make sure the settings are consistent. Use GPO or edit the registry settings to do so. Automation is key here. Uninstalling the patch works but is not a good idea. This vulnerability is serious.

image

Now Microsoft did warn about this change. You can even read about it on the PFE blog https://blogs.technet.microsoft.com/askpfeplat/tag/encryption-oracle-remediation/. Nevertheless, many people seem to have been bitten by this one. I know it’s hard to keep up with everything that is moving at the speed of light in IT but this is one I was on top of. This is due to the fact that the fix is for a remote vulnerability in RDS. That’s a big deal and not one I was willing let slide. You need to roll out the updates and you need to configure your policy and make sure you’re secured. The alternative (rolling back the updates, allowing vulnerable connections) is not acceptable, be vulnerable to a known and fixable exploit. TAKE YOUR MEDICIN!  Read the links above for detailed guidance on how to do this. Set your policy on both sides to mitigated. You don’t need to force updated clients to fix the issue this way and you can patch your servers 1st followed by your clients. Do note the tips given on doing this in the PFE blog:

Note: Ensure that you update the Group Policy Central Store (Or if not using a Central Store, use a device with the patch applied when editing Group Policy) with the latest CredSSP.admx and CredSSP.adml. These files will contain the latest copy of the edit configuration settings for these settings, as seen below.

Registry
Path: HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\System\CredSSP\Parameters
Value: AllowEncryptionOracle
Date type: DWORD
Reboot required: Yes

Here’s are the registry settings you need to make sure connectivity is restored

Everything patched: 0 => when all is patched including 3rd party CredSSP clients you can use “Force updated clients”
server patched but not all clients: 1 =>use “mitigated”, you’ll be as secure as possible without blocking people. Alternatively you can use 2 (“vulnerable”) but avoid that if possible  as it is more risky, so I would avoid that.

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\CredSSP][HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\CredSSP\Parameters]
“AllowEncryptionOracle”=dword:00000001

So, check your clients and servers, both on-premises and in the cloud to make sure you’re protected and have as little RDS connectivity issues as possible. Don’t forget about 3rd party clients that need updates to if you have those! Don’t panic and carry on.

Windows Admin Center Supports Windows Server 2016 HCI (S2D) RS1 with KB4093120

Excellent news on the manageability front for S2D. The April update for Windows Admin Center (‘Project Honolulu’) in combination with KB4093120 adds support for Hyper-Converged Infrastructure  deployments running on Windows Server 2016 (RS1). Do note that this is still  in preview.

To get started you’ll need to download the latest version of Windows Admin Center, and install the April 17th 2018-04 Cumulative Update for Windows Server 2016, KB4093120, on every node in a Storage Spaces Direct cluster you want to manage. The Hyper-Converged Infrastructure experience depends on new management APIs that are added in this update.

Don’t forget that there is a bundle of information on Windows Admin Center to be found in the documentation specifically around HCI.

ReFS Supported Deployment Scenarios Updated

Introduction

Some support statements for ReFS have been updated recently. These reflect well over a year of me, fellow MVPs and others testing and providing feedback to Microsoft. For all practical purposes I’m talking about ReFSv3, which was introduced with Windows Server 2016. Read up on this because that’s what I’m discussing here: Resilient File System (ReFS) overview

As many you know the ReFS supported storage deployment option has “fluctuated a bit. It was t limited ReFS to Storage Spaces and standalone disks only. That meant no RAID controllers, no FC or iSCSI LUNs via a SAN whether that was a high end one or and entry level one that you normally only use for backup purposes.

I was never really satisfied with the reasons why and I kept being a passionate advocate for a decent explanation as tying a files system with the capabilities and potential of ReFS to almost a single storage solution (S2D, and yes that’s a very good HCI offering) isn’t going to help proliferate the goodness of ReFS around the globe.

I was not alone and many others, amongst them fellow MVPs Anton Gostev (Senior Vice President, Product Management at Veaam and an industry heavy weight when it comes to credibility and technical skill), Cars ten Rachfahl and Jan Kappen (both at Rachfahl IT-Solutions) were arguing he case for broader ReFS support. Last week we go the news that the ReFS deployment documentation had been revised. Guest what? Progress! A big thank you to Andrew Hansen for taking the time to hear us plead or case, listen to our testing results and passionate feedback. He picked up the ball, ran with it and delivered! Let’s take a look.

ReFS Storage Deployment Options

Storage Spaces Direct

Deploying ReFS on Storage Spaces Direct is recommended for virtualized workloads or network-attached storage. This is well known and is used for a Hyper Converged Infrastructure and Converged (SOFS) solution (Hyper-V, IIS, SQL, User Profile Disks and even archival or backup targets). You can deploy it with simple, mirrored (2-way or 3-way), parity or Mirror accelerated parity volumes.

Storage Spaces

Storage Spaces supports local non-removable direct-attached via BusTypes SATA, SAS, NVME, or attached via HBA (aka RAID controller in pass-through mode). You can deploy it with simple, mirrored (2-way or 3-way) or parity volumes. Do note that this can be both non-shared as shared storage spaces (Shared SAS enclosures). This is the high available solution with storage spaces we have before Windows Server 2016 added S2D.

Basic disks

Deploying ReFS on basic disks is best suited for applications that implement their own software resiliency and availability solutions. Applications that introduce their own resiliency and availability software solutions can leverage integrity-streams, block-cloning, and the ability to scale and support large data sets. A poster child for this use case is and Exchange DAG.

Now it is important to note that basic disks with ReFS are supported with local non-removable direct-attached disks via BusTypes SATA, SAS, NVME, or RAID. So yes, you can have RAID 1, 5,6,10 and make the storage redundant. Now, be smart, ReFS is great but it is not magic. If your workload requires redundancy and high availability you should provide it. This is not different when you use NTFS. When you have shared PCI RAID controllers (which can be redundant like in a DELL VRTX) this can be uses as well to create high availability deployments with shared storage.

SAN Storage

You can also use ReFS with a SAN over FC or iSCSI, normally those are always configured with some form of storage redundancy. You can consume the ReFS SAN storage on stand alone, member or clustered serves for high availability. As long as you use that storage for supported use cases. For example, it is and remains not support to put knowledge worker data on SOFS shares, not matter what the underlying storage for ReFS or NTFS volumes is. For backups this can leveraged to build some very capable solutions.

What were the concerns that made ReFS Support so limited at a given point in time?

Well one of them was confusion and concerns around how data gets flushed and persisted with non-storage spaces and simple disks. A valid concern but one you have with any file system so any storage array or controller needs to handle this well. As it turns out any decent piece of storage hardware/controller that’s on the Microsoft Hardware Compatibility List and is certified does its job well enough to guarantee this happens correctly. So, any certified OEM SAN, both entry level ones to high end enterprise grade gear is supported. Just like any good (certified) raid controller. Those are backed with battery backed caches that can survive down time for days to many weeks. You just pick the one that fits your needs, use case and budget form the options you have. That can be S2D, a SAN, a raid controller, or even basic directly attached disks.

My take on things

Why do I like the new supported options? Well because I have been testing them for backup targets, both high available one as non- high available one. I can have the benefits of ReFS that can be leveraged by backup software (Veeam Backup & Replication 9.5 for example) and have better performance, data protection with more type of storage than S2D. I like to have options and choices when designing as solution.

It is important to note one thing when you do not use ReFS in combination with Storage Spaces (S2D, Shared storage Spaces or “stand alone” storage spaces) with any form of data redundancy (2-way or 3-way mirror, parity, mirror accelerate parity). You will not have the built-in capability to repair data corruption than can occur while data sits on disk (bit rot) by leveraging the redundant copies in storage Spaces. That only comes when ReFS is combined with redundant Storage Spaces. Not with Simple Storage Spaces or any other storage array, redundant or not. The combination of ReFS with Storage Spaces offers this capability and is one of its selling points.

Other than that, the above ReFS storage deployment options let you leverage the benefits ReFS has to offer and yes, for some use case that will be preferred over NTFS. But don’t think NTFS should now only be used for the OS and such. That’s not the case. It is and remains very much the dominant file system for Windows. It’s just that now we get to leverage the goodness of ReFS for suitable scenarios with a lot more storage deployment options. This has a reason. For example, if you are going to do Hyper-V with a SAN the supported file system is NTFS, not ReFS. Mind you ReFS works but it’s not supported. I have tested this and while it works one of the concerns is the redirect IO traffic this incurs. With S2D the network fabric to deal with this is there by design: SMB Direct (RDMA) over 10Gbps or better. With a SAN that’s not necessarily so and as a result the network leveraged by CSV traffic might take a beating. The network traffic behavioral patterns are also different with ReFS versus NTFS on SAN based CSV than what you are used to with NFTS when it comes to owner and non-owner nodes. While I can make things work I must consider the benefits versus the risk of being unsupported. On a good SAN with ODX support that’s not worth the risk. Might this ever change? Maybe, but for now that’s it.

That said, when I design my ReFS LUNs and fabric well with a SAN and use them for a supported uses case like backup targets I am supported and I get to leverage the benefits of ReFS as it fits the use case very well (DPM, Veeam).

A side note on mirror accelerated parity

Mirror accelerated parity is only supported with S2D. That’s the only thing that, in regards to backup an archive targets that I want to keep testing (see Hyper-V Amigos Showcast Episode 12 – ReFS and Backup )and asking Microsoft to support at least on non-shared Storage spaces. I know shared storage spaces is being depreciated, no worries. That would make for some great, budget, archival and backup targets due to the fact you get bit rot protection due to the combination ReFS with redundant Storage Spaces. I even have some ideas on how to add tuning capabilities to the mirror / parity movement of data based on data age etc. I can dream right ?

Conclusion

To all the naysayers, the ones that bashed me when I discussed options for and the potential for ReFSv3 outside of S2D, take note, this is where we are today.

clip_image001

And I like it. I like the options ReFSv3 offers with variety of storage solutions to design and implement backup targets for many different needs and budgets. That’s what I like as I’m convinced that one size fits all solution are an illusion. Even at economies of scale and with commodity materials understanding the context in which to design and implement a solution matters, as it allows you to chose the proper methods for the given needs when you genuinely understand the challenge.

If you need help with this there are quite a number of highly skilled, experienced people with the right mindset to make help you maximize your ROI and TCO in an effective and efficient way. Many of these are MVPs and have their own business or work for IT firms where customers are not milked like cattle but really do provide high value services. Just reach out.

Latency kills

Introduction

I was investigating a very problematic Windows Server 2016 Hyper-V cluster. That cluster was just performing horribly. “Everything” was hanging, stalling, crashing and RHS.exe errors where flying around while WER dumps got created by the dozen. Things were extremely slow up to the points functionality was just failing. The “fun” thing was that the cluster validation wizard while slow gave that cluster a big thumb up and a supported status as all was well.

Prying around

Time to pry around a bit and see if we could find something wrong. We save live migrations stall, fail, last forever in pending or get stuck at a certain percentage, sometimes finally succeeding with ridiculous blackout times. We could not open up virtual machine properties or very slowly. The FCI GUI was highly unresponsive but so was the Hyper-V Manager GUI or even PowerShell. Those were hanging at even loading the virtual machines or enumerating them with Get-VM. Everything was slow to the point it timed out or crashed. Restarting the services (Cluster, Hyper-V) didn’t do anything and restarting VMMS was super slow or just got stuck. It was a depressing sight for which people tended to blame Hyper-V / Microsoft.

As the title gives away it was latency. Not just ordinary high latency. Real bad latency. That kind of latency kills. Extreme latency produces symptoms that are similar to bugs or corrupt components of roles and features. We have a tendency to look at those first in the event logs and then we look at the network and its usual suspects (VMQ, SET, DCB). But nothing pointed to an issue that I could find.

So, storage maybe?Well we did find one Hyper-V host in the cluster with one HBA port producing too many error so we disabled that FC port for testing. No joy the Hyper-V cluster after a clean reboot of all nodes remained problematic. So on to the storage array itself.

Well holy smoke! On the two volumes for CSV in those cluster we saw latencies that were so bad I could not even believe a single VM would boot. It actually made my appreciation for Hyper-V and clustering grow as it managed to do at least a couple of things. With such latencies I would expect the services to just crash & call it a day.

clip_image002

The horrific latency on one of the CSV LUNs.

Looking at the logs we saw that the latencies occurred on the FC HBAs of the controllers. Each one above 50ms, peaking to 150-250ms and one huge peak at almost 500ms. We saw this on all four HBA’s.

clip_image004

The latency on one of the 4 FC HBA’s on one of the controllers. Not a good day. All HBAs had high latencies like this.

The issues were not at the host level (host HBA’s) or not even at the IOPS/bandwidth level of the storage itself. The latency for some reason was spiking. Further investigation lead to the conclusion that the issue was related to synchronous replication going totally wrong. Moving the replication mode to asynchronous fixed that. We’re now investigating why this happened and how to prevent this from happening again. But that’s another story.

clip_image006

Latency on one of the 4 FC HBAs on one of the controllers after we fixed the issue.

Do not assume anything

So, there you go. Everything depends on everything in some direct or indirect way. It’s all connected and that my friends, is why I’m a proponent of “service resilience engineering” where the responsible team owns the entire stack. That’s is how you can act fast.