Optimize the Veeam preferred networks backup initialization speed

When Veeam preferred networks cause slow backup initialization speeds

When using preferred networks in Veeam you choose to use another than the default host network for backups and restores. In this post, we’ll discuss how to optimize the Veeam preferred networks backup initialization speed because we aim for optimal performance. TL-DR: You need to provide connectivity to the preferred networks for the Veeam Backup & Replication server. It seems a common mistake I run into every now and then. Ultimately it makes people think Veeam is slow. No, it is just a configuration mistake.

Why use a preferred network?

Backups can fill up a 1Gbps pipe very fast. Many people still use 1Gbps networking as default connectivity to the hosts. Even when they leverage 10Gbps or better it is often in a converged network setup. This means that only part of the bandwidth goes to host connectivity. Few have 10Gbps for “just” host connectivity. This means it makes sense to select a different higher bandwidth network for backup and restore traffic.

Hence for high volume, high-performance backup and restores it is smart to look for a bigger pipe to leverage. Some environments have dedicated backup networks at 10Gbps or better. But we find way more high bandwidth networks for other purposes. In Hyper-V environments, you’ll have those for SMB networking like CSV, Live Migration variants and storage replication. Hyper-Converged Infrastructure deployments use these networks for storage as well. With S2D you’ll find more and more 25/50/100Gbps. All these can be leveraged as a preferred backup network in Veeam

Setting up a preferred network

Setting up a preferred network is easy. First of all, you figure out which network to use. You then add those to the preferred networks as follows:

In file menu select “Network Traffic Rules”

Optimize the Veeam preferred network backup initialization speed

Click “Add” and specify the source IP as well as the target IP range. You can op to encrypt the traffic and /or set a bandwidth limit.

Optimize the Veeam preferred network backup initialization speed

There is no need to have the preferred network registered in DNS. It will work fine without.

I hope it is clear that the source (Hyper-V Hosts), the target (backup repository or the extends in a Scale-Out Backup Repository) and any Off Host Proxies need connectivity to the preferred network(s). If you leverage WAN accelerators, Gateways Servers, log shipping servers than these also need access. Last but not least you should also make sure that the Veeam Backup Server (VBR) has access to the preferred networks. This is one that a lot of people seem to forget. May because it is most often a VM if it is not a shared role on the repository server or such and things do work without it.

When the VBR server has no access to the preferred networks things still work but initialization of the backup and restore jobs is a lot slower. Let’s test this.

Slow Initialization of backup and restore jobs

As a result of using preferred networks you might probably notice the following:

  • First of all, we notice a slow down in the overall initialization of the backup and restore job.
  • This manifests itself in a slow start of the actual VM backup/restore and reducing the number of simultaneous backups/restores of VMs within a job.

Without the VBR server having connectivity to the preferred networks

23:54 to complete the backup job (no connectivity to the preferred network)

Optimize the Veeam preferred networks backup initialization speed

With the VBR server having connectivity to the preferred networks. Notice how smooth and continuous the throughput is.

07:55 to complete the backup job (with connectivity to the preferred network) => 3 times as fast.

When you look into the Veeam backup logs for this job you will find at various stages attempts by the VBR server to connect to the preferred networks. If it can’t it has to wait until it times out. You see entries like:

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.10.110.2:2509 (System.Net.Sockets.SocketException)

Optimize the Veeam preferred network backup initialization speed
Just a small part of all the NetSocket time out you will find for every single VM in the job. Here VBR is trying to connect to one of the extends in the SOBR.

This happens for every file in the backups (config files and disks) for every extend in the Scale-Out Backup Repository (per VM backup chain). This slows down the entire backup job tremendously.

Conclusion

I always make sure that the VBR servers in my environments have preferred network connectivity. Consequently, initialization is faster for both backups and restores. Test it out for yourself! It is the first thing I check when people complain of really slow backup. Do they have preferred networks set up? Check if the VBR server has connectivity to them!

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backups

Introduction

It’s not a secret that while guest clustering with VHDSets works very well. We’ve had some struggles in regards to host level backups however. Right now I leverage Veeam Agent for Windows (VAW) to do in guest backups. The most recent versions of VAW support Windows Failover Clustering. I’d love to leverage host level backups but I was struggling to make this reliable for quite a while. As it turned out recently there are some virtual machine permission issues involved we need to fix. Both Microsoft and Veeam have published guidance on this in a KB article. We automated correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

The KB articles

Early August Microsoft published KB article with all the tips when thins fail Errors when backing up VMs that belong to a guest cluster in Windows. Veeam also recapitulated on the needed conditions and setting to leverage guest clustering and performing host level backups. The Veeam article is Backing up Hyper-V guest cluster based on VHD set. Read these articles carefully and make sure all you need to do has been done.

For some reason another prerequisite is not mentioned in these articles. It is however discussed in ConfigStoreRootPath cluster parameter is not defined and here https://docs.microsoft.com/en-us/powershell/module/hyper-v/set-vmhostcluster?view=win10-ps You will need to set this to make proper Hyper-V collections needed for recovery checkpoints on VHD Sets. It is a very unknown setting with very little documentation.

But the big news here is fixing a permissions related issue!

The latest addition in the list of attention points is a permission issue. These permissions are not correct by default for the guest cluster VMs shared files. This leads to the hard to pin point error.

Error Event 19100 Hyper-V-VMMS 19100 ‘BackupVM’ background disk merge failed to complete: General access denied error (0x80070005). To fix this issue, the folder that holds the VHDS files and their snapshot files must be modified to give the VMMS process additional permissions. To do this, follow these steps for correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup.

Determine the GUIDS of all VMs that use the folder. To do this, start PowerShell as administrator, and then run the following command:

get-vm | fl name, id
Output example:
Name : BackupVM
Id : d3599536-222a-4d6e-bb10-a6019c3f2b9b

Name : BackupVM2
Id : a0af7903-94b4-4a2c-b3b3-16050d5f80f

For each VM GUID, assign the VMMS process full control by running the following command:
icacls <Folder with VHDS> /grant “NT VIRTUAL MACHINE\<VM GUID>”:(OI)F

Example:
icacls “c:\ClusterStorage\Volume1\SharedClusterDisk” /grant “NT VIRTUAL MACHINE\a0af7903-94b4-4a2c-b3b3-16050d5f80f2”:(OI)F
icacls “c:\ClusterStorage\Volume1\SharedClusterDisk” /grant “NT VIRTUAL MACHINE\d3599536-222a-4d6e-bb10-a6019c3f2b9b”:(OI)F

My little PowerShell script

As the above is tedious manual labor with a lot of copy pasting. This is time consuming and tedious at best. With larger guest clusters the probability of mistakes increases. To fix this we write a PowerShell script to handle this for us.

#Didier Van Hoye
#Twitter: @WorkingHardInIT 
#Blog: https://blog.Workinghardinit.work
#Correct shared VHD Set disk permissions for all nodes in guests cluster

$GuestCluster = "DemoGuestCluster"
$HostCluster = "LAB-CLUSTER"

$PathToGuestClusterSharedDisks = "C:\ClusterStorage\NTFS-03\GuestClustersSharedDisks"


$GuestClusterNodes = Get-ClusterNode -Cluster $GuestCluster

ForEach ($GuestClusterNode in $GuestClusterNodes)
{

#Passing the cluster name to -computername only works in W2K16 and up.
#As this is about VHDS you need to be running 2016, so no worries here.
$GuestClusterNodeGuid = (Get-VM -Name $GuestClusterNode.Name -ComputerName $HostCluster).id

Write-Host $GuestClusterNodeGuid "belongs to" $GuestClusterNode.Name

$IcalsExecute = """$PathToGuestClusterSharedDisks""" + " /grant " + """NT VIRTUAL MACHINE\"+ $GuestClusterNodeGuid + """:(OI)F"
write-Host "Executing " $IcalsExecute
CMD.EXE /C "icacls $IcalsExecute"

} 

Below is an example of the output of this script. It provides some feedback on what is happening.

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

Correcting the permissions on the folder with VHDS files & checkpoints for host level Hyper-V guest cluster backup

PowerShell for the win. This saves you some searching and typing and potentially making some mistakes along the way. Have fun. More testing is underway to make sure things are now predictable and stable. We’ll share our findings with you.

Storage-level corruption guard

One of the many gems in Veeam Backup & Replication v9 is the introduction of storage-level corruption guard for primary backup jobs. This was already a feature for backup copy jobs. But now we have the option of periodically scanning or backup files for storage issues.It works like this: if any corrupt data blocks are found the correct ones are retrieved from the primary storage and auto healed. Ever bigger disks, vast amounts of storage and huge amounts of data mean more chances of bit rot. It’s an industry wide issue. Microsoft tries to address this with ReFS and storage space for example where you also see an auto healing mechanism based on retrieving the needed data from the redundant copies.

We find this option on the maintenance tab of the advanced setting for the storage settings of a backup job, where you can enable it and set a schedule.

image

The idea behind this is that this is more efficient than doing periodical active full backups to protect against data corruption. You can reduce them in frequency or, perhaps better, get rid of those altogether.

Veeam describes Storage-level corruption guard as follows:

image

Can it replace any form of full backup completely? I don’t think so. The optimal use case seems to lie in the combination of storage-level corruption guard with periodic synthetic backups. Here’s why. When the bit rot is in older data that can no longer be found in the production storage, it could fail at doing something about it, as the correct data is no longer to be found there. So we’ll have to weigh the frequency of these corruption guard scans to determine what reduction if making full backups is wise for our environment and needs. The most interesting scenario to deal with this seems to be the one where we indeed can eliminate periodic full backups all together. To mitigate the potential issue of not being able to recover, which we described above, we’d still create synthetic full backups periodically in combination with the Storage-level corruption guard option enabled. Doing this gives us the following benefits:

  • We protect our backup against corruption, bit rot etc.
  • We avoid making periodic full backups which are the most expensive in storage space, I/O and time.
  • We avoid having no useful backup let in the scenario where Storage-level corruption guard needs to retrieve data from the primary storage that is no longer there.

To me this seems to be a very interesting scenario. To optimize backup times and economies. In the end it’s all about weighing risks versus cost and effort. Storage-level corruption guard gives us yet another tool to strike a better balance between those two. I have enabled it on a number of the jobs to see how it does in real life. So far things have been working out well.

Dell Storage Replay Manager 7.6.0.47 for Compellent 6.5

Recently as a DELL Compellent customer version 7.6.0.47 became available to us. I download it and found some welcome new capabilities in the release notes.

  • Support for vSphere 6
  • 2024 bit public key support for SSL/TLS
  • The ability to retry failed jobs (Microsoft Extensions Only)
  • The ability to modify a backup set (Microsoft Extensions Only)

The ability to retry failed jobs is handy. There might be a conflicting backup running via a 3rd party tool leveraging the hardware VSS provider. So the ability to retry can mitigate this. As we do multiple replays per day and have them scheduled recurrently we already mitigated the negative effects of this, but this only gibes us more options to deal with such situations. It’s good.

image

The ability to modify a backup set is one I love. It was just so annoying not to be able to do this before. A change in the environment meant having to create a new backup set. That also meant keeping around the old job for as long as you wanted to retain the replays associated with that job. Not the most optimal way of handling change I’d say, so this made me happy when I saw it.

image

Now I’d like DELL to invest a bit more in make restore of volume based replays of virtual machines easier. I actually like the volume based ones with Hyper-V as it’s one snapshot per CSV for all VMs and it doesn’t require all the VMs to reside on the host where we originally defined the backup set. Optimally you do run all the VMs on the node that own the CSV but otherwise it has less restrictions. I my humble opinion anything that restricts VM mobility is bad and goes against the grain of virtualization and dynamic optimization. I wonder if this has more to do with older CVS/Hyper-V versions, current limitations in Windows Server Hyper-V or CVS or a combination. This makes for a nice discussion, so if anyone from MSFT & the DELL Storage team responsible for Repay Manager wants to have one, just let me know Smile 

Last but not least I’d love DELL to communicate in Q4 of 2015 on how they will integrate their data protection offering in Compellent/Replay manager with Windows Server 2016 Backup changes and enhancements. That’s quite a change that’s happing for Hyper-V and it would be good for all to know what’s being done to leverage that. Another thing that is high on my priority for success is to enable leveraging replays with Live Volumes. For me that’s the biggest drawback to Live Volumes: having to chose between high/continuous availability and application consistent replays for data protection and other use cases).

I have some more things on my wish list but these are out of scope in regards to the subject of this blog post.