Replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity

Introduction

I use Storage spaces in various environments for several use cases, even with clients (see Move Storage Spaces from Windows 8.1 to Windows 10). In this blog post, we’ll walk through replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity. have a number of DELL R740XD stand-alone servers with a ton of storage that I use as backup targets. See A compact, high capacity, high throughput, and low latency backup target for Veeam Backup & Replication v10 for a nice article on a high-performance design with such servers. They deliver the repositories for the extents in Veeam Backup & Replication Scale-out Backup Repositories. They have MVME’s for the performance tier in Storage Spaces with Mirror Accelerated Parity.

Even with the best hardware and vendor, a disk can fail and yes, it happened to one of our NVME drives. Reseating the disk did not help and we were one disk shot in device manager. So yes the disk was dead (or worse the bus where it was seated, but that is less likely).

Replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity

So let’s take a look by running

Get-PhysicalDisk 

I immediately see that we have an NVME that has lost communication, it is gone and also no longer displays a disk number. It seems to be broken.

Replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity

That means we need to get rid of it in the storage spaces pool so we can replace it.

Getting rid of the failed disk properly

I put the disk that lost communication into a variable

$ProblemDisk = Get-PhysicalDisk | where-object OperationalStatus -like *lost*

We than retire the problematic disk

$ProblemDisk | Set-PhysicalDisk -Usage retired

We then run Get-PhysicalDisk again and yes, we see the disk was retired.

Replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity

Now grab that retired disk and save it to a parameter by running

$RetiredDisk = Get-PhysicalDisk | where-object  Usage -like *Retired*

Now remove the retired disk from the storage pool by running

Get-StoragePool -FriendlyName BackupStoragePool | Remove-PhysicalDisk -PhysicalDisk $RetiredDisk

Let this complete and check again with Get-PhysicalDisk, you will see the problematic disk has gone. Note that there are only 7 NVME disks left.

Replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity

It does not show an unrecognized disk that is still visible to the OS somehow. So we cannot try to reset it to try to get it back into action. We need to replace it and so we request a replacement disk with DELL support and swap them out.

Putting the new disk into service

Now we have our new disk we want it to be added to the storage pool. You should now see the new disk in Disk Manager as online and not initialized. Or when you select Add Physical Disk in the storage pool in Server Manager.

But we were doing so well in PowerShell so let’s continue there. We will add the new disk to the storage pool. Run

$DiskToAddToPool = Get-PhysicalDisk | where-object  Canpool -eq True

Get-StoragePool -FriendlyName BackupStoragePool | Add-PhysicalDisk -PhysicalDisk $DiskToAddToPool

When you run Get-PhysicalDisk again you will see that there are no disks left that can be pooled, meaning they are all in the storage pool. And we have 8 NMVE disks agaibn Good!

Now run

Optimize-StoragePool -FriendlyName BackupStoragePool

And let it run. You can check up on its progress via this little script.

while(1 -eq 1) {
Get-storagejob
write-host 'Wait'
start-sleep -seconds 10
} 
Keeping an eye on the storage pool optimization process

That’s it. All is well again and rebalanced. It also ensures the storage capacity contributed by the replaced disk will be available in the performance tier when I want to create an extra virtual disk. Storage Spaces at its best giving me the opportunity to leverage NVMe with other disks while maximizing the benefits of ReFS.

For more info on stand-alone storage space and PowerShell, you can find more info in Deploy Storage Spaces on a stand-alone server

Conclusion

As you have seen replacing a failed disk in a stand-alone Storage Spaces with Mirror Accelerated Parity is not too hard to do. You just need to wrap your head around how storage spaces work and investigate the commands a little. For that I recommend practicing on a Virtual Machine.

SFP+ and SFP28 compatibility

Introduction

As 25Gbps (SFP28) is on route to displace 10Gbps (SFP+) from its leading role as the work horse in the datacenter. That means that 10Gbps is slowly but surely becoming “the LOM option”. So it will be passing on to the role and place 1Gbps has held for many years. What extension slots are concerned we see 25Gbps cards rise tremendously in popularity. The same is happening on the switches where 25-100Gbps ports are readily available. As this transition takes place and we start working on acquiring 25Gbps or faster gear the question about SFP+ and SFP28 compatibility arises for anyone who’s involved in planning this.

SPF+ and SFP28 compatibility

Who needs 25Gbps?

When I got really deep into 10Gbps about 7 years ago I was considered a bit crazy and accused of over delivering. That was until they saw the speed of a live migration. From Windows Server 2012 and later versions that was driven home even more with shared nothing and storage live migration and SMB 3 Multichannel SMB Direct.

On top of that storage spaces and SOFS came onto the storage scene in the Microsoft Windows server ecosystem. This lead us to S2D and storage replica in Windows Server 2016 and later. This meant that the need for more bandwidth, higher throughput and low latency was ever more obvious and clear. Microsoft has a rather extensive collection of features & capabilities that leverage SMB 3 and as such can leverage RDMA.

In this time frame we also saw the strong rise of All Flash Array solutions with SSD and NVMe. Today we even see storage class memory come into the picture. All this means even bigger needs for high throughput at low latency, so the trend for ever faster Ethernet is not over yet.

What does this mean?

That means that 10Gbps is slowly but surely becoming the LOM option and is passing on to the role 1Gbps has held for many years. In our extension slots we see 25-100Gbps cards rise in popularity. The same is happening on the switches where we see 25, 50, 100Gbps or even higher. I’m not sure if 50Gbps is ever going to be as popular but 25Gbps is for sure. In any case I am not crazy but I do know how to avoid tech debt and get as much long term use out of hardware as possible.

When it comes to the optic components SFP+ is commonly used for 10Gbps. This provides a path to 40Gbps and 100Gbps via QSFP. For 25Gbps we have SFP28 (1 channel or lane for 25Gbps). This give us a path to 50Gbps (2225Gbps – two lanes) and to 100Gbps (4*25Gbps – 4 lanes) via QSFP28. In the end this a lot more economical. But let’s look at SFP+ and SFP28 compatibility now.

SFP+ and SFP28 compatibility

When it comes to SFP+ and SFP28 compatibility we’re golden. SFP+ and SFP28 share the same form factor & are “compatible”. The moment I learned that SFP28 share the same form factor with SFP+ I was hopeful that they would only differ in speed. And indeed, that hope became a sigh of relief when I read and experimentally demonstrated to myself the following things I had read:

  1. I can plug in a SFP28 module into an SFP+ port
  2. I can plug in a SFP+ module into an SFP28 port
  3. Connectivity is established at the lowest common denominator, which is 10Gbps
  4. The connectivity is functional but you don’t gain the benefits SFP28 bring to the table.

Compatibility for migrations & future proofing

For a migration path that is phased over time this is great news as you don’t need to have everything in place right away from day one. I can order 25Gbps NIC in my servers now, knowing that they will work with my existing 10Gbps network. They’ll be ready to roll when I get my switches replaced 6 months or a year later. Older servers with 10Gbps SFP+ that are still in production when the new network gear arrives can keep working on new SFP28 network gear.

  • SFP+: 10Gbps
  • SFP28: 25Gbps but it can go up to 28 so the name is SFP28, not 25. Note that SFP28 can handle 25Gbps, 10Gbps and even 1Gbps.
  • QSFP28: 100Gbps to 4*25Gbps or 2*50Gbps gives you flexibility and port density.
  • 25Gbps / SFP28 is the new workhorse to deliver more bandwidth, better error control, less cross talk and an economical sound upgrade path.

Do note that SFP+ modules will work in SFP28 ports and vice versa but you have to be a bit careful:

  • Fix the ports speed when you’re not running at the default speed
  • On SFP28 modules you might need to disable options such as forward error correction.
  • Make sure a 10Gbps switch is OK with a 25Gbps cables, it might not.

If you have all your gear from a vendor specializing in RDMA technology like Mellanox this detects this all this and takes care of everything for you. Between vendors and 3rd party cables pay extra attention to verifying all will be well.

SFP+ and SFP28 compatibility is also important for future proofing upgrade paths. When you buy and introduce new network gear it is nice to know what will work with what you already have and what will work with what you might or will have in the future. Some people will get all new network switches in at once while others might have to wait for a while before new servers with SFP28 arrive. Older servers might be around and will not force you to keep older switches around just for them.

SFP28 / QSFP28 provides flexibility

Compatibility is also important for purchase decision as you don’t need to match 25Gbps NIC ports to 25Gbps switch ports. You can use the QSFP28 cables and split them to 4 * 25Gbps SFP28.

SPF+ and SFP28 compatibility

QSFP28

The same goes for 50Gbps, which is 100Gbps QSFP to 2 * 50Gbps QSFP.

SPF+ and SFP28 compatibility

SPF+ and SFP28 compatibility

This means you can have switch port density and future proofing if you so desire. Some vendors offer modular switches where you can mix port types (Dell EMC Networking S6100-ON)

Conclusion

More bandwidth at less cost is a no brainer. It also makes your bean counters happy as this is achieved with less switches and cables. That also translates to less space in a datacenter, less consumption of power and less cooling. And the less material you have the less it cost in operational expenses (management and maintenance). This is only offset partially by our ever-growing need for more bandwidth. As converged networking matures and becomes better that also helps with the cost. Even where economies of scale don’t matter that much. The transition to 25Gbps and higher is facilitated by SFP+ and SFP28 compatibility and that is good news for all involved.

ReFS Supported Deployment Scenarios Updated

Introduction

Some support statements for ReFS have been updated recently. These reflect well over a year of me, fellow MVPs and others testing and providing feedback to Microsoft. For all practical purposes I’m talking about ReFSv3, which was introduced with Windows Server 2016. Read up on this because that’s what I’m discussing here: Resilient File System (ReFS) overview

As many you know the ReFS supported storage deployment option has “fluctuated a bit. It was t limited ReFS to Storage Spaces and standalone disks only. That meant no RAID controllers, no FC or iSCSI LUNs via a SAN whether that was a high end one or and entry level one that you normally only use for backup purposes.

I was never really satisfied with the reasons why and I kept being a passionate advocate for a decent explanation as tying a files system with the capabilities and potential of ReFS to almost a single storage solution (S2D, and yes that’s a very good HCI offering) isn’t going to help proliferate the goodness of ReFS around the globe.

I was not alone and many others, amongst them fellow MVPs Anton Gostev (Senior Vice President, Product Management at Veaam and an industry heavy weight when it comes to credibility and technical skill), Cars ten Rachfahl and Jan Kappen (both at Rachfahl IT-Solutions) were arguing he case for broader ReFS support. Last week we go the news that the ReFS deployment documentation had been revised. Guest what? Progress! A big thank you to Andrew Hansen for taking the time to hear us plead or case, listen to our testing results and passionate feedback. He picked up the ball, ran with it and delivered! Let’s take a look.

ReFS Storage Deployment Options

Storage Spaces Direct

Deploying ReFS on Storage Spaces Direct is recommended for virtualized workloads or network-attached storage. This is well known and is used for a Hyper Converged Infrastructure and Converged (SOFS) solution (Hyper-V, IIS, SQL, User Profile Disks and even archival or backup targets). You can deploy it with simple, mirrored (2-way or 3-way), parity or Mirror accelerated parity volumes.

Storage Spaces

Storage Spaces supports local non-removable direct-attached via BusTypes SATA, SAS, NVME, or attached via HBA (aka RAID controller in pass-through mode). You can deploy it with simple, mirrored (2-way or 3-way) or parity volumes. Do note that this can be both non-shared as shared storage spaces (Shared SAS enclosures). This is the high available solution with storage spaces we have before Windows Server 2016 added S2D.

Basic disks

Deploying ReFS on basic disks is best suited for applications that implement their own software resiliency and availability solutions. Applications that introduce their own resiliency and availability software solutions can leverage integrity-streams, block-cloning, and the ability to scale and support large data sets. A poster child for this use case is and Exchange DAG.

Now it is important to note that basic disks with ReFS are supported with local non-removable direct-attached disks via BusTypes SATA, SAS, NVME, or RAID. So yes, you can have RAID 1, 5,6,10 and make the storage redundant. Now, be smart, ReFS is great but it is not magic. If your workload requires redundancy and high availability you should provide it. This is not different when you use NTFS. When you have shared PCI RAID controllers (which can be redundant like in a DELL VRTX) this can be uses as well to create high availability deployments with shared storage.

SAN Storage

You can also use ReFS with a SAN over FC or iSCSI, normally those are always configured with some form of storage redundancy. You can consume the ReFS SAN storage on stand alone, member or clustered serves for high availability. As long as you use that storage for supported use cases. For example, it is and remains not support to put knowledge worker data on SOFS shares, not matter what the underlying storage for ReFS or NTFS volumes is. For backups this can leveraged to build some very capable solutions.

What were the concerns that made ReFS Support so limited at a given point in time?

Well one of them was confusion and concerns around how data gets flushed and persisted with non-storage spaces and simple disks. A valid concern but one you have with any file system so any storage array or controller needs to handle this well. As it turns out any decent piece of storage hardware/controller that’s on the Microsoft Hardware Compatibility List and is certified does its job well enough to guarantee this happens correctly. So, any certified OEM SAN, both entry level ones to high end enterprise grade gear is supported. Just like any good (certified) raid controller. Those are backed with battery backed caches that can survive down time for days to many weeks. You just pick the one that fits your needs, use case and budget form the options you have. That can be S2D, a SAN, a raid controller, or even basic directly attached disks.

My take on things

Why do I like the new supported options? Well because I have been testing them for backup targets, both high available one as non- high available one. I can have the benefits of ReFS that can be leveraged by backup software (Veeam Backup & Replication 9.5 for example) and have better performance, data protection with more type of storage than S2D. I like to have options and choices when designing as solution.

It is important to note one thing when you do not use ReFS in combination with Storage Spaces (S2D, Shared storage Spaces or “stand alone” storage spaces) with any form of data redundancy (2-way or 3-way mirror, parity, mirror accelerate parity). You will not have the built-in capability to repair data corruption than can occur while data sits on disk (bit rot) by leveraging the redundant copies in storage Spaces. That only comes when ReFS is combined with redundant Storage Spaces. Not with Simple Storage Spaces or any other storage array, redundant or not. The combination of ReFS with Storage Spaces offers this capability and is one of its selling points.

Other than that, the above ReFS storage deployment options let you leverage the benefits ReFS has to offer and yes, for some use case that will be preferred over NTFS. But don’t think NTFS should now only be used for the OS and such. That’s not the case. It is and remains very much the dominant file system for Windows. It’s just that now we get to leverage the goodness of ReFS for suitable scenarios with a lot more storage deployment options. This has a reason. For example, if you are going to do Hyper-V with a SAN the supported file system is NTFS, not ReFS. Mind you ReFS works but it’s not supported. I have tested this and while it works one of the concerns is the redirect IO traffic this incurs. With S2D the network fabric to deal with this is there by design: SMB Direct (RDMA) over 10Gbps or better. With a SAN that’s not necessarily so and as a result the network leveraged by CSV traffic might take a beating. The network traffic behavioral patterns are also different with ReFS versus NTFS on SAN based CSV than what you are used to with NFTS when it comes to owner and non-owner nodes. While I can make things work I must consider the benefits versus the risk of being unsupported. On a good SAN with ODX support that’s not worth the risk. Might this ever change? Maybe, but for now that’s it.

That said, when I design my ReFS LUNs and fabric well with a SAN and use them for a supported uses case like backup targets I am supported and I get to leverage the benefits of ReFS as it fits the use case very well (DPM, Veeam).

A side note on mirror accelerated parity

Mirror accelerated parity is only supported with S2D. That’s the only thing that, in regards to backup an archive targets that I want to keep testing (see Hyper-V Amigos Showcast Episode 12 – ReFS and Backup )and asking Microsoft to support at least on non-shared Storage spaces. I know shared storage spaces is being depreciated, no worries. That would make for some great, budget, archival and backup targets due to the fact you get bit rot protection due to the combination ReFS with redundant Storage Spaces. I even have some ideas on how to add tuning capabilities to the mirror / parity movement of data based on data age etc. I can dream right ?

Conclusion

To all the naysayers, the ones that bashed me when I discussed options for and the potential for ReFSv3 outside of S2D, take note, this is where we are today.

clip_image001

And I like it. I like the options ReFSv3 offers with variety of storage solutions to design and implement backup targets for many different needs and budgets. That’s what I like as I’m convinced that one size fits all solution are an illusion. Even at economies of scale and with commodity materials understanding the context in which to design and implement a solution matters, as it allows you to chose the proper methods for the given needs when you genuinely understand the challenge.

If you need help with this there are quite a number of highly skilled, experienced people with the right mindset to make help you maximize your ROI and TCO in an effective and efficient way. Many of these are MVPs and have their own business or work for IT firms where customers are not milked like cattle but really do provide high value services. Just reach out.

Does the DELL VRTX Support Storage Spaces anno 2018?

Some one asked on my blog if the DELL VRTX supported Storage Spaces. It’s 2018 and when I wrote about the VRTX it was mainly as a Cluster in a Box (CiB) solution. This is based on a shared SAS raid controller. The addition of a second controller improved the redundancy (past the write-through requirement as we had in 2014) even though I would really like to see a native in bow redundant network solution here as well. Whether this is suitable for your need is something only you can determine.

Bus as far as a support for Microsoft Shared Storage Spaces or Storage Spaces goes that isn’t there and I would advise against it. A storage controller configuration (pass-through) for the DELL Technologies VRTX series that supported any form of Storage Spaces never came. While with 2 Nodes and the VRTX supporting two storage controller this would theoretically be possible. But with 3 or 4 nodes (The VRXT supports up to 4 nodes) that’s another challenge.

While I have liked the idea and suggested it even as a possible path it has never materialized. If S2D, especially in combination with ReFSv3 or beyond, becomes so immensely popular, they might consider it, but for now it’s not something I see happen and they might very choose other offerings to serve that demand anyway, one with a better design for the separate pass-through capable storage controllers.

As a cluster in a box solution the VRTX does hold merit. As said, I’d love to see a few improvements made to make it fully redundant all in box. With a ruggedized version for industrial or highly mobile environments could make an unbeatable offering.

DISCLAIMER: I don’t work for DELL, I don’t get paid by DELL, I don’t speak for DELL. This is my current independent opinion.