Dell Compellent SCOS 6.7 ODX Bug Heads Up

UPDATE 3: Bad and disappointing news. After update 2 we’ve seen DELL change the CSTA (CoPilot Services Technical Alert)  on the customer website to “’will be fixed” in a future version. No according to the latest comment on this blog post that would be In Q1 2017. Basically this is unacceptable and it’s a shame to see a SAN that was one of the best when in comes to Hyper-V Support in Windows Server 2012 / 2012 R2 decline in this way. If  7.x is required for Windows Server 2016 Support this is pretty bad as it means early adopters are stuck or we’ll have to find an recommend another solution. This is not a good day for Dell storage.

UPDATE 2: As you can read in the comments below people are still having issues. Do NOT just update without checking everything.

UPDATE: This issue has been resolved in Storage Center 6.7.10 and 7.Ximage

If you have 6.7.x below 6.7.10 it’s time to think about moving to 6.7.10!

No vendor is exempt form errors, issues, mistakes and trouble with advances features and unfortunately Dell Compellent has issues with Windows Server 2012 (R2) ODX in the current release of SCOS 6.7. Bar a performance issue in a 6.4 version they had very good track record in regards to ODX, UNMAP, … so far. But no matter how good your are, bad things can happen.

DellCompellentModern

I’ve had to people who were bitten by it contact me. The issue is described below.

In SCOS 6.7 an issue has been determined when the ODX driver in Windows Server 2012 requests an Extended Copy between a source volume which is unknown to the Storage Center and a volume which is presented from the Storage Center. When this occurs the Storage Center does not respond with the correct ODX failure code. This results in the Windows Server 2012 not correctly recognizing that the source volume is unknown to the Storage Center. Without the failure code Windows will continually retry the same request which will fail. Due to the large number of failed requests, MPIO will mark the path as down. Performing ODX operations between Storage Center volumes will work and is not exposed to this issue.

You might think that this is not a problem as you might only use Compellent storage but think again. Local disks on the hosts where data is stored temporarily and external storage you use to transport data in and out of your datacenter, or copy backups to are all use cases we can encounter.  When ODX is enabled, it is by default on Windows 2012(R2), the file system will try to use it and when that fails use normal (non ODX) operations. All of this is transparent to the users. Now MPIO will mark the Compellent path as down. Ouch. I will not risk that. Any IO between an non Compellent LUN and a Compellent LUN might cause this to happen.

The only workaround for now is to disable ODX on all your hosts. To me that’s unacceptable and I will not be upgrading to 6.7 for now. We rely on ODX to gain performance benefits at both the physical and virtual layer. We even have our SMB 3 capable clients in the branch offices leverage ODX to avoid costly data copies to our clustered Transparent Failover File Servers.

When a version arrives that fix the issue I’Il testing even more elaborate than before. We’ve come to pay attention to performance issues or data corruption with many vendors, models and releases but this MPIO issue is a new one for me.

New KB Article 2494016 Related to Windows Server 2008 SP1 Hyper-V: Stop error 0x0000007a When Using CVS in Redirected Access

Well not a day after my blog post Extra Info on Clustering & Hyper-V with Dynamic Memory When You Start With Windows Server 2008 R2 SP1on important hotfixes for Hyper-V clustering with Windows Server 2008 R2 SP1 Microsoft releases a new hot fix for issue below. I’ll add it to the post to keep up to date.

Stop error 0x0000007a occurs on a virtual machine that is running on a Windows Server 2008 R2-based failover cluster with a cluster shared volume, and the state of the CSV is switched to redirected access

The KB article with instructions on how to get the hot fix is here: http://support.microsoft.com/kb/2494016/en-us?sd=rss&spid=14134

The scenario is detailed as follows:

Consider the following scenario:

  • You enable the cluster shared volume (CSV) feature on a Windows Server 2008 R2-based failover cluster.
  • You create a virtual machine on the CSV on a cluster node.
  • You start the virtual machine on the cluster node.
  • You move the CSV owner to another cluster node, and you change the state of CSV to redirected access.
  • The connection that is used for redirected access is switched to another connection when one of the following scenarios occurs:
    • The cable for local area network (LAN) is disconnected.
    • The related network adapter is disabled.
    • The connection is switched by using Failover Cluster Manager.

In this scenario, you receive a Stop error message that resembles the following in the virtual machine:

STOP 0x0000007a ( parameter1 , parameter2 , parameter3 , parameter4 )
KERNEL_DATA_INPAGE_ERROR

Note

  • The parameters in this Stop error message vary, depending on the configuration of the computer.
  • Not all "0x0000007a" Stop error messages are caused by this issue.
  • You may also receive other Stop error messages when this issue occurs. For example, you may receive a "0x0000004F" Stop error message.

Extra Info on Clustering & Hyper-V with Dynamic Memory When You Start With Windows Server 2008 R2 SP1:

Here’s a quick “heads up” if your starting to use or thinking about using Windows Server 2008 R2 SP1 for your Hyper-V clusters. The most common issues I’ve seen in the wild are:

  1. https://blog.workinghardinit.work/2011/04/01/kb2230887-hotfix-for-dynamic-memory-with-windows-2008-standard-web-edition-does-not-apply-to-without-hyper-v-editions/ This one is being worked on and the hotfix will be re-released to support the “Without Hyper-V” SKU of Windows Server 2008 SP2.  It’s a simple oversight but one that can be important when your Hyper-V clusters are filled with that SKU.
  2. We also got bitten by this one Déjà vu Bug: The network connection of a running Hyper-V virtual machine may be lost under heavy outgoing network traffic on a computer that is running Windows Server 2008 R2 SP1, but the hotfix was already available luckily.
  3. And than one to head and to read the TechNet forum about Cluster Validation Bug In Windows 2008 R2 SP1 – Disk has a Persistent Reservation on it. They are also working on a fix. I’ve written a blog post on this and I suggest you read it and also take note of the discussion in the TechNet forum.

    UPDATE: The hotfix for issue 3 has become available today, April 26th 2011 as announced on the TechNet forum here:

    A hotfix is now available that addresses the Win2008 R2 service pack 1 issue with Validate on a 3+ node cluster. This is KB 2531907. The KB article and download link will be published shortly, in the mean time you can obtain this hotfix immediately free of charge by calling Microsoft support and referencing KB 2531907.   Update 27/05/2011 Here is the link: http://support.microsoft.com/kb/2531907/en-us?sd=rss&spid=14134

An other one that I haven’t seen in the wild is:

Windows Server 2008 R2 installation may hang if more than 64 logical processors are active. There is is a workaround and a hotfix for this one.

Issue: When you try to install Windows Server 2008 R2 on a computer that has more than 64 logical processors, Windows Setup may stop responding in one of the following operations:

  • Initialization of Windows Setup
  • One of the two restarts that are required to complete Setup

Cause: This issue occurs because of an error in the Network Driver Interface Specification.This issue occurs because of an error in the Network Driver Interface Specification (NDIS) driver.
When a computer has more than 64 logical processors, the NDIS driver does not correctly handle some operations. Therefore, the computer encounters stop responding issues and other system failures.

I don’t have any nodes under my care who have more than 64 logical processors so that’s why I guess Smile But with ever more cores available you it’s bound to happen in the near future.

Update 2: To keep me busy this KB article was released within 24 hours of me posting this blog on a BSOD with CSV and redirected access for witch a hot fix is available

Stop error 0x0000007a occurs on a virtual machine that is running on a Windows Server 2008 R2-based failover cluster with a cluster shared volume, and the state of the CSV is switched to redirected access

The KB article with instructions on how to get the hot fix is here: http://support.microsoft.com/kb/2494016/en-us?sd=rss&spid=14134

The scenario is detailed as follows:

Consider the following scenario:

  • You enable the cluster shared volume (CSV) feature on a Windows Server 2008 R2-based failover cluster.
  • You create a virtual machine on the CSV on a cluster node.
  • You start the virtual machine on the cluster node.
  • You move the CSV owner to another cluster node, and you change the state of CSV to redirected access.
  • The connection that is used for redirected access is switched to another connection when one of the following scenarios occurs:
    • The cable for local area network (LAN) is disconnected.
    • The related network adapter is disabled.
    • The connection is switched by using Failover Cluster Manager.

In this scenario, you receive a Stop error message that resembles the following in the virtual machine:

STOP 0x0000007a ( parameter1 , parameter2 , parameter3 , parameter4 )
KERNEL_DATA_INPAGE_ERROR

Note

  • The parameters in this Stop error message vary, depending on the configuration of the computer.
  • Not all "0x0000007a" Stop error messages are caused by this issue.
  • You may also receive other Stop error messages when this issue occurs. For example, you may receive a "0x0000004F" Stop error message.

Cluster Validation Bug In Windows 2008 R2 SP1 – Disk has a Persistent Reservation on it

Pretty soon after the RTM of Windows 2008 R2 SP1 release we were discussing a bug on the TechNet forum (Hyper-V Cluster issues after applying Win2008 R2 SP1 on a 3 node Cluster!) here. If you have a Windows 2008 R2 SP1 cluster with more than 2 nodes you get the following warning:

List Potential Cluster Disks

Disk with identifier 2sef8cdf has a Persistent Reservation on it. The disk might be part of some other cluster. Removing the disk from validation set

“Normally” you would expect such a warning if the LUN ever belonged to another cluster and it needs the old reservation cleared. To do that you would use following command on the node that throws the warning (where in this example the disk is disk 2 in disk manager/diskpart) and after making sure it is not in use anywhere else in the SAN

"cluster node clusternode1 /clearpr:2"

However this is not the cause here as were most others in this discussion. And I’m pretty no san software or MPIO software is putting a reservation on there either so what is this? A bug? Well yes, it has been confirmed by Microsoft support that is is indeed a bug an that is fix will be made available by April 18th2011 .

This was not a show stopper bug, but it could be one if you needed to add a host to a cluster and confirm all is well and supported. However if you’re certain you’ve done everything right you can choose not to run cluster validation.

I will update this blog with more information when the fix becomes available.

UPDATE:  The hotfix has become available today, April 26th 2011 as announced on the TechNet forum here:

A hotfix is now available that addresses the Win2008 R2 service pack 1 issue with Validate on a 3+ node cluster.  This is KB 2531907.  The KB article and download link will be published shortly, in the mean time you can obtain this hotfix immediately free of charge by calling Microsoft support and referencing KB 2531907. Update 27/05/2011 Here is the link: http://support.microsoft.com/kb/2531907/en-us?sd=rss&spid=14134