When using file shares as backup targets you should leverage continuous available SMB 3 file shares

Introduction

When using file shares as backup targets you should leverage Continuous Available SMB 3 file shares. For now, at least. A while back Anton Gostev wrote a very interesting piece in his “The Word from Gostev”. It was about an issue that they saw with people using SMB 3 files shares as backup targets with Veeam Backup & Replication. To some it was a reason to cry wolf. But it’s a probably too little-known issue that can and a such might (will) occur. You need to be aware of it to make good decisions and give good advice.

I’m the business of building rock solid solutions that are highly available to continuous available. This means I’m always looking into the benefits and drawbacks of design choices. By that I mean I study, test and verify them as well. I don’t do “Paper Proof of Concepts”. Those are just border line fraud.

So, what’s going on and what can you do to mitigate the risk or avoid it all together?

Setting the scenario

Your backup software (in our case Veeam Backup & Recovery) running on Windows leverages an SMB 3 file share as a backup target. This could be a Windows Server file share but it doesn’t have to be. It could be a 3rd party appliance or storage array.

When using file shares as backup targets you should leverage Continuous Available SMB 3 file shares.

The SMB client

The client is the SMB 3 Client Microsoft delivers in the OS (version depends on the OS version). But this client is under control of Microsoft. Let’s face it the source in these scenarios is a Hyper-V host/cluster or a Windows SMB 3 Windows File share, clustered or not.

The SMB server

In regards to the target, i.e. the SMB Server you have a couple of possibilities. Microsoft or 3rd party.

If it’s a third-party SMB 3 implementation on Linux or an appliance. You might not even know what is used under the hood as an OS and 3rd party SMB 3 solution. It could be a storage vendors native SMB 3 implementation on their storage array or simple commodity NAS who bought a 3rd party solution to leverage. It might be high available or in many (most?) cases it is not. It’s hard to know if the 3rd party implements / leverages the full capabilities of the SMB 3 stack as Microsoft does or not. You light not know of there are any bugs in there or not.

You get the picture. If you bank on appliances, find out and test it (trust but verify). But let’s assume its capabilities are on par with what Windows offers and that means the subject being discussed goes for both 3rd party offerings and Windows Server.

When the target is Windows Server we are talking about SMB 3 File Shares that are either Continuous Available or not. For backup targets General Purpose File Shares will do. You could even opt to leverage SOFS (S2D for example). In this case you know what’s implemented in what version and you get bug fixes from MSFT.

When you have continuously available (CA) SMB 3 shares you should be able to sleep sound. SMB 3 has you covered. The risks we are discussing is related to non-CA SMB 3 file shares.

What could go wrong?

Let’s walk through this. When your backup software writes to an SMB 3 share it leverages the SMB 3 client & server in the SMB 3 stack. Unlike when Veeam uses its own data mover, all the cool data persistence stuff is handled by Windows transparently. The backup software literally hands of the job to Windows. Which is why you can also leverage SMB Multichannel and SMB direct with your backups if you so desire. Read Veeam Backup & Replication leverages SMB Multichannel and Veeam Backup & Replication Preferred Subnet & SMB Multichannel for more on this.

If you are writing to a non-CA SMB 3 share your backup software receives the messages the data has been written. Which actually means that the data is cached in the SMB Clients “queue” of data to write but which might not have been written to the storage yet.

For short interruptions this is survivable and for Office and the like this works well and delivers fast performance. If the connection is interrupted or the share is unavailable the queue keeps the data in memory for a while. So, if the connection restores the data can be written. The SMB 3 Client is smart.

However, this has its limits. The data cache in the queue doesn’t exist eternally. If the connectivity loss or file share availability take too long the data in the SMB 3 client cache is lost. But it was not written to storage! To add a little insult to injury the SBM client send back “we’re good” even when the share has been unreachable for a while.

For backups this isn’t optimal. Actually, the alarm bell should start ringing when it is about backups. Your backup software got a message the data has been written and doesn’t know any better. But is not on the backup target. This means the backup software will run into issues with corrupted backups sooner or later (next backup, restores, synthetic full backups, merges, whatever comes first).

Why did they make it this way?

This is OK default behavior. it works just fine for Office files / most knowledge worker client software that have temp files, auto recovery, and all such lovely capabilities and work is mostly individual and interactive. Those applications are resilient to this by nature. Mind you, all my SMB 3 file share deployments are clustered and highly available where appropriate. By “appropriate” I mean when we don’t have off line caching for those shares as a requirement as those too don’t mix well (https://blogs.technet.microsoft.com/filecab/2016/03/15/offline-files-and-continuous-availability-the-monstrous-union-you-should-not-consecrate/). But when you know what your doing it rocks. I can actually failover my file server roles all day long for patching, maintenance & fun when the clients do talk SMB 3. Oh, and it was a joy to move that data to new SANs under the hood. More on that perhaps in another post. But I digress.

You need adequate storage in all uses cases

This is a no brainer. Nothing will save you if the target storage isn’t up to the task. Not the Veeam data move or SMB3 shares with continuous availability. Let’s be very clear about this. Even at the cost-effective side of the equation the storage has to be of sufficient decent quality to prevent data loss. That means decent controllers with battery cached IO as safe guard etc. Whether that’s a SAN or a “simple” raid controller or pass through HBA’s for storage spaces, doesn’t matter. You have to have it. Putting your data on SATA drives without any save guard is sure way of risking data loss. That’s as simple as it gets. You don’t do that, unless you don’t care. And if you care, you would not be reading this!

Can this be fixed?

Well as a non-SMB 3 developer I would say we need an option added that the SMB 3 client can be configured to not report success until that data has been effectively written on the target, or at least has landed somewhere on quality, cache protected storage.

This option does not exist today. I do not work for Microsoft but I know some people there and I’m pretty sure they want to fix it. I’m just not sure how big of a priority it is at the moment. For me it’s important that when a backup application goes to a non-continuous available file share it can request that it will not cache and the SMB Server says “OK” got it, I will behave accordingly. Now the details in the implementation will be different but you get the message?

I would like to make the case that it should be a configurable option. It is not needed for all scenarios and it might (will) have an impact on performance. How big that would be I have no clue. I’m just a blogger who does IT as a job. I’m not a principal PM at Microsoft or so.

If you absolutely want to make sure, use clustered continuous available file shares. Works like a charm. Read this blog Continuous available general purpose file shares & ReFSv3 provide high available backup targets, there is even one of my not so professional videos show casing this.

It’s also important not to panic. Most of you might even never has heard or experienced this. But depending on the use case and the quality of the network and processes you might. In a backup scenario this is not something that makes for a happy day.

The cry wolf crowd

I’ll be blunt. WARNING. Take a hike if you have a smug “Windoze sucks” attitude. If you want to deal dope you shouldn’t be smoking too much of your own stuff, but primarily know it inside out. NFS in all its varied implementations has potential issues as well. So, I’d also do my due diligence with any solution you recommend. Trust but verify, remember?! Actually, an example of one such an issue was given for an appliance with NFS by Veeam. Guess what, every one has issues. Choose your poison, drink it and let other chose theirs. Condescending remarks just make you look bad every time. And guess what that impression tends to last. Now on the positive side, I hear that caching can be disabled on modern NFS client implementations. So, the potential issue is known and is is being addressed there as well.

Conclusion

Don’t panic. I just discussed a potential issue than can occur and that you should be aware off when deciding on a backup target. If you have rock solid networking and great server management processes you can go far without issues, but that’s not 100 % fail proof. As I’m in the business of building the best possible solutions it’s something you need to be aware off.

But know that they can occur, when and why so you can manage the risk optimally. Making Windows Server SMB 3 file shares Continuously Available will protect against this effectively. It does require failover clustering. But at least now you know why I say that when using file shares as backup targets you should leverage continuous available SMB 3 file shares

When you buy appliances or 3rd party SMB 3 solutions, this issue also exists but be extra diligent even with highly available shares. Make sure it works as it should!

I hope Microsoft resolves this issue as soon as possible. I’m sure they want to. They want their products to be the best and fix any possible concerns you might have.

Storage Migration Service or MSFT adds WorkingHardInIT like skills in Windows Server 2019

As readers of my blog will know very well is that one of my track records is that I’m fairly successful in keeping technology debt low. One of the workloads I have always moved successfully over the years are file servers. Right now, almost all, almost everywhere, are the goodness of SMB3 and are high available clustered file server roles. You can read about some of those efforts in some of my blogs:

Sometimes this cans quite a challenge but over the years I have gained the experience and expertise to deal with such projects. Not everyone is in that position. This leads to aging file services, technology debt, security risks, missed opportunities (SMB3 people!) and often even unsupported situations in regards to hardware and software. While since the release of Windows Server 2016 this image has shifted a bit, you can clearly see the pretty depressing state of file services. Windows rules the on premises server world but the OS versions are aging fast.

clip_image002

Image courtesy of SpiceWorks: Server Virtualization and OS Trends

The operating systems are ancient, old or aging and we all know what this means in regards to the SMB version in use.

Now I work hard, effective and efficiently but I cannot be everywhere. Luckily for you Microsoft has a great new capability coming up in Windows Server 2019, the Windows Server Storage Migration Service.

clip_image004

Figure2: the WorkingHardInIT feature officially known as the Windows Server Storage Migration Service. Image courtesy of Microsoft.

Thanks to Ned Pyle and his team’s efforts you can no aspire to me as successful as me when it comes to migrating file services to newer environments. It’s like having WorkingHardInIT in a Windows feature. Isn’t that cool?! If they sold that separately they would make a pretty penny but they luckily for you include it with their new server OS version.

Storage Migration Servers deals with many of the usual problems and intricacies of a storage / file service migration. It will handle things like share settings, security settings, network addresses and names, local accounts, encrypted data, files that are in use etc. To handle you project you have a GUI and PowerShell automation at your disposal. Windows Server 2019 is still being perfected and you can still provide feedback while testing this feature.

Things to note for now (bar the requirements for testing as described in Ned Pyle’s blog) are:

Supported source operating systems VM or hardware (to migrate from):

  • Windows Server 2003
  • Windows Server 2008
  • Windows Server 2008 R2
  • Windows Server 2012
  • Windows Server 2012 R2
  • Windows Server 2016
  • Windows Server 2019 Preview

Supported destination operating system VM or hardware (to migrate to):

  • Windows Server 2019 Preview*

* Technically your destination for migration can be any Windows Server OS, but we aren’t testing that currently. And when we release the faster and more efficient target proxy system in a few weeks it will only run on Windows Server 2019.

In my humble opinion that almost has to be Storage Replica technology being leveraged. Something that has been proven to be very much more efficient that copying files. Microsoft already promotes Storage Replica to other server or itself as a way of moving data to a new LUN.

Anyway, this is a cool feature that should grab your attention. Thank you Ned Pyle and team! And while you’re busy putting this great capability into Windows Server 2019 (Standard and Datacenter) consider doing the same for full featured Storage Replica in Windows Server 2019 Standard .

Windows Data Deduplication and Cluster Operating System Rolling Upgrades

Introduction

Have you considered when Windows data deduplication and cluster operating system rolling upgrades from Windows Server 2012 R2 to Windows Server 2016 Clusters are discussed we often hear people talk about Hyper-V or Scale Out File Server clusters, sometimes SQL but not very often for a General-Purpose File Share server with continuous availability. Which is the kind I’ve done quite a number of actually.

clip_image002

Being active in an industry that produces and consumes file data in large quantities and sizes we have been early implementers of Windows cluster for General-Purpose File Shares with continuous availability. This provides us the benefits of SMB 3, ODX for both the clients and IT Operations for workloads that are not suited for Scale Out File Server deployments.

As such we have dealt with a number of Windows 2012 R2 GPFS with GA cluster that we wanted to move to Windows Serer 2016. Partially to keep the environment up to date and partially because we want to leverage the new Windows deduplication capabilities that this OS version offers. The SOFS and Hyper clusters that I upgraded didn’t have data deduplication enabled.

The process to perform this upgrade is straight forward and has been documented well by others as well as by me in regards to issues we saw in the field . We even dove behind the scenes a bit in Cluster Operating System Rolling Upgrade Leaves Traces. I have also presented on this topic in public at conferences around Europe (Ireland, Germany and Belgium) as part of our community contributions. No surprises there.

Test your assumptions

This is a scenario you can perform without any downtime for your clients when all things go well. And normally it should. I have upgraded a couple of Scale Out File Server (SOFS) and General-Purpose File Server (GPFS) cluster with Continuous availability now and those went very well. Just make sure your cluster is perfectly healthy at the start.

Naturally there are some check you need to make that are outside of Microsoft scope:

I’m pretty sure you have good backups for your file data and you should check this works with Windows Server 2016 and how it reacts during the upgrade while the server is in mixed mode. Perhaps you will or won’t be able to run backups or restore data. Check and know this.

Verify your storage solution supports and words with Windows Server 2016. It sounds obvious but I have seen people forget such details.

Another point of attention is any Anti-Virus you might have running on the file server cluster nodes. Verify that this is fully supported on Windows Server 2016. On top of that validate that the Anti-Virus still works well with ODX so you don’t run into surprises there. Don’t assume anything.

Check if the server and it components (HBA, NICs, BIOS, …) its firmware and drivers support Windows Server 2016. Sure, the rolling upgrade allows for some testing before committing but that doesn’t mean you should go ahead blindly into the unknown.

Make sure your nodes are fully patched before and after the upgrade of a cluster node.

As the file server cluster is already leveraging SMB 3 with continuous availability al the prerequisites to make that work are already take care of. If you are upgrading a File server cluster without continuous availability and are planning to start using this, that’s another matter and you’ll need to address any issues. You can do this before or after moving to Windows Server 2016. This means you’d move to a solution before you upgrade or after you have performed the upgrade to Windows Server 2016.

You can take a look at my blogs on this subject from the Windows 2012 R2 time frame such as More Tips On Dealing With Removing Short File Names When Migrating To a SMB3 Transparent Failover File Server Cluster, Migrate an old file server to a transparent failover file server with continuous availability and SMB 3, ODX, Windows Server 2012 R2 & Windows 8.1 perform magic in file sharing for both corporate & branch offices

Data deduplication takes some extra consideration

I have blogged before on how Windows Server 2016 Data Deduplication performs and scales better than it did it Windows Server 2012 R2. This also means that it works at least partially different than it did Windows Server 2012 R2. You can see this in some of the updates that came out in regards to a data corruption bug with data deduplication which only affected Windows Server 2016.

clip_image003

Given this difference, what would happen if you fail over a LUN with deduplication enable from Windows Server 2012 to Windows Server 2016 and vice versa? That’s the question I had to consider when combining Windows data deduplication and cluster operating system rolling upgrades for the first time.

Windows Server 2016 is backward compatible and will work just fine with a LUN that from and Windows Server 2012 Server that has Windows data deduplication enabled. The reverse is not the case. Windows Server 2012 R2 is not forward compatible. When dealing with data deduplication in an Operating System Rolling Upgrade scenario I’m extra careful as I cannot guarantee any LUN movement scenario will go well. With a standalone server

Once I have failed over a LUN to Windows Server 2016 node in a mixed cluster I avoid moving it back to a Windows Server 2012 R2 node in that cluster. I only move them between Windows Server 2016 nodes when needed.

I move through the rolling upgrade as fast as I can to minimize the time frame in which a LUN with data deduplication could end up moving from a Windows Server 2016 o a Windows Server 2012 cluster node.

Should I need to reverse the Operating System Rolling Upgrade to end up with a Windows Server 2012 R2 cluster again I’ll make absolutely sure I can restore the data from LUNs with data deduplication from backup and/or a snapshot from a SAN or such. You cannot guarantee that this will work out fine. So be prepared.

For “standard” non deduplicated NTFS LUNs you can fail back if needed. When data deduplication is enabled you should try to avoid that and be prepared to restore data if needed.

Final advise is always the same

Even when you have tested your upgrade scenario and made sure your assumptions are correct you must have a way out. And as always, “One is none, two is one”.

As always during such endeavors you need to make sure that you have a roll back scenario in things do not work out. You must also have a fail back plan for when things turn really bad. For most scenarios has the ability to return to the original situation built in. But things can go wrong badly and Murphy’s Law does apply. So also have the backups and restore verified just in case.

The last thing you need after a failed upgrade is telling your customer or employer “it almost worked” but unfortunately, they’ve lost that 200TB of continuous available data. Better next time doesn’t really cut it.