When thinking about a virtualization cluster, resiliency is one of the words that generally is used to describe the virtual infrastructure with various high availability and distributed resource features enabled. The underlying Windows Failover Cluster technology that hosts the Hyper-V role in a highly available virtualization configuration provides the features and functionality that allow administrators to have baked in resiliency with their Hyper-V environments.

With Windows Server 2016, there have been some new features added that bolster resilience for the Windows Server Failover Cluster technology as relates specifically to Hyper-V. The new resiliency features allow the Windows Server Failover Cluster to better handle failures other than specific hardware failures, including what Microsoft calls Transient Failures. Two new features in Windows Server 2016 called VM Compute Resiliency and VM Storage Resiliency allow better handling of transient failures in both compute and storage.

Protect Your Data with BDRSuite

Cost-Effective Backup Solution for VMs, Servers, Endpoints, Cloud VMs & SaaS applications. Supports On-Premise, Remote, Hybrid and Cloud Backup, including Disaster Recovery, Ransomware Defense & more!
  • What are transient failures?
  • How do the new resiliency features weigh into handling transient failures?

Let’s take a look at Windows Server 2016 VM Compute and Storage Resiliency and how these new features bolster Hyper-V resiliency.

What are Transient Failures?

Microsoft has noted that what it deems are transient failures are more common than an all-out hard failure in the Hyper-V environment backed by Windows Server Failover Clusters. Also, administrators themselves or automatic cluster processes reacting aggressively to these transient failures can cause more downtime than they prevent. These transient failures generally occur with intra-cluster communication in the compute cluster.

The three common transient scenarios as outlined by Microsoft for which the new VM Compute Resiliency feature is well-suited:

Download Banner
  • Node disconnected – A node in the Windows Server Failover Cluster attempts to connect to all active nodes and fails to connect to any node in the active cluster membership
  • Cluster Service crash – The Cluster Service on a particular Windows Server Failover Cluster node is down or otherwise not responding. The node is not communicating with any other node in the cluster
  • Asymmetric disconnect – A node in the Windows Server Failover Cluster attempts to connect to all the active nodes in the cluster. It can talk to at least one node, but not all active nodes

If the above disconnects are transient in nature, or only lasting for a short period of time, reacting drastically to the situation may result in more downtime than would otherwise be needed. Due to this being the case, Microsoft has introduced three new states of virtual machines in Windows Server 2016 that reflect a new workflow of sorts with handling cluster failures as they relate to Hyper-V virtual machines.

New Windows Server 2016 Hyper-V Failover Clustering States and Workflow

Microsoft has introduced three new Failover Clustering states that are designed to better handle these transient failures as they relate to Windows Server 2016 Hyper-V virtual machines. They include:

  • Unmonitored – This is a Hyper-V virtual machine that is no longer being monitored by the cluster service
  • Isolated – The cluster node is no longer an active cluster member, but it is hosting active Hyper-V virtual machines
  • Quarantined – In this state, the cluster node is no longer allowed to join the cluster
    • The default period of time for quarantine is 2 hours
    • This keeps a node from flapping and negatively impacting the Hyper-V cluster
    • A node is quarantined if it ungracefully leaves the cluster three times within an hour

By utilizing the new Hyper-V virtual machine states in Windows Server 2016, Microsoft has designed a new workflow for VM resiliency that allows the compute cluster to handle these transient failures. The workflow proceeds in the following order once a transient failure has been detected:

  1. The Hyper-V node is placed in an Isolated state and removed from active cluster membership
  2. The Hyper-V virtual machine is now shown in an unmonitored state
  3. VM Storage is affected in the following way:
    • SMB storage or virtual machines is still accessible and online
    • Virtual Machines backed by block storage devices are affected since an isolated node is no longer allowed to access any Cluster Shared Volumes. The virtual machine is placed in a Paused Critical state
    • Once the node is in an isolated state, if it continues to experience the transient failures within a 4-minute window of time, the virtual machines are failed over to a healthy cluster node and the node is placed in a downstate
    • If a node experiences isolation three times in an hour, it is placed in a Quarantine state for a 2-hour period of time by default and again, the virtual machines are failed over to a healthy cluster node

The difference between Windows Server 2012 R2 and Windows Server 2016 is with 2012 R2, after 10 missed heartbeats (default interval) the node is removed from the cluster, VMs are failed over without any 4-minute period of time before action is taken. As explained above, Windows Server 2016 adds the four-minute interval and other workflow into the picture to allow a more gauged approach to handling the so-called transient failures that may happen from time to time with intra-cluster communication.

When it comes to configuring the VM Compute Resiliency, PowerShell can do this quickly and easily.

  • (Get-Cluster).ResiliencyLevel = where can equal either 1 or 2. When set to “1” this results in the “pre” Windows Server 2016 behavior, where a node is failed immediately and VMs are failed over. When it is set to “2” which is the default in Windows Server 2016, the Hyper-V node goes into an isolated state for a period of time before VMs are failed over
  • (Get-Cluster).ResiliencyDefaultPeriod = where value by default is 240. This represents the 240 seconds of grace period where the node can return to a healthy state. If you set this to 0 this reverts to Windows Server 2012 R2 functionality
  • (Get-Cluster).QuarantineThreshold = where is the number of failures before a node is quarantined
  • (Get-Cluster).QuarantineDuration = where is the duration to disallow cluster node join in seconds

  • Start-ClusterNode -ClearQuarantine – Allows manually clearing the quarantine condition on a quarantined Hyper-V cluster node

Hyper-V Virtual Machine Storage Resiliency

Another critical component of dealing with transient failures with intra-cluster communication is virtual machine storage resiliency. As mentioned with the VM compute resiliency workflow, storage is adversely affected, especially when utilizing block storage attached as Cluster Shared Volumes to back Hyper-V virtual machines. However, any other issue that results in a virtual machine losing connectivity to storage can certainly be transient in nature.

Virtual Machine Storage Resiliency proactively detects storage failures and reacts to reduce the impact to the Hyper-V virtual machine and minimizes the chance for corruption. When a failed read or write to a virtual hard disk is detected, the Hyper-V virtual machine is placed in a critical pause state.

In the critical pause state, additional I/O operations are prevented to the virtual machine as the virtual machine is paused or “frozen”. When the storage is once again responsive and the virtual machine is placed back in the running state, the VM is able to read and write I/O. The great thing about the critical pause state is that it captures and retains the exact session state of the virtual machine. Especially for short transient storage disconnects or unresponsiveness, once the transient issue is past, the VM will once again resume sessions with clients. This behavior results in a much more minimal impact on clients. If the configured timeout period elapses and the VM is still unable to connect to storage, it is simply powered off.

vm-storage-resellience-workflow

Hyper-V VM Storage Resiliency Workflow (image courtesy of Microsoft)

The Virtual Machine Storage Resiliency is enabled by default, however, it is configurable with PowerShell:

  • Set-VM -AutomaticCriticalErrorAction – the default value here is
  • Set-VM –AutomaticCriticalErrorActionTimeout where is in minutes. The default is 30 minutes

Concluding Thoughts

The new features in Windows Server 2016 that bolster virtual machine and storage resiliency allow the default handling of transient failures to be much less impactful than in previous Windows Server versions. By having built-in stages of remediation for transient failures, Windows Server 2016 Failover Clusters hosting Hyper-V are much better equipped to deal with these various disconnect scenarios either from a cluster service perspective, storage, or both. As with all configuration areas of Windows Server Hyper-V, PowerShell makes changing the default configuration very simple and easy with various one-liner cmdlets. Hyper-V administrators taking advantage of the new VM Compute and Storage Resiliency features of Hyper-V in Windows Server 2016 along with the native Hyper-V high availability features, have a powerful platform for hosting highly available and resilient business-critical virtual machines.

Follow our Twitter and Facebook feeds for new releases, updates, insightful posts and more.

5/5 - (1 vote)