What is VMware vSphere Fault Tolerance and How Does it Work?

Read on

VMware for Beginners – vMotion and DRS: Part 11
VMware for Beginners – vSphere HA Configuration: Part 12(a)
VMware for Beginners – vSphere HA Configuration: Part 12(b)
VMware for Beginners – vSphere HA Configuration: Part 12(c)
VMware for Beginners – What is vSphere Proactive HA?: Part 13
Read More

In the last VMware for Beginners, we discussed vSphere Proactive HA. In this final blog post about High Availability in vSphere, we will learn and discuss vSphere Fault Tolerance(FT).

Protect Your Data with BDRSuite

Cost-Effective Backup Solution for VMs, Servers, Endpoints, Cloud VMs & SaaS applications. Supports On-Premise, Remote, Hybrid and Cloud Backup, including Disaster Recovery, Ransomware Defense & more!

Learn More

vSphere Fault Tolerance is a big subject and would need more than two blog posts to explain everything about the various configurations and how to work with vSphere Fault Tolerance with its full features and use cases.

To simplify it, in two blog posts I will only focus on what it is, how It works, and how to configure and do a simple VM FT with some failover.

What will we discuss in this vSphere Fault Tolerance?

What is vSphere Fault Tolerance?
How does vSphere Fault Tolerance work?
vSphere Fault Tolerance restrictions
vSphere Fault Tolerance requirements

What is vSphere Fault Tolerance

vSphere Fault Tolerance is a feature of VMware’s vSphere virtualization platform that provides continuous availability for virtual machines (VMs). It creates a secondary copy, or “shadow instance,” of a running VM that is synchronized with the primary instance in real-time.

The secondary instance is kept in lockstep with the primary instance using a technology called vLockstep, which mirrors all of the actions taken on the primary VM to the secondary VM. If the primary VM fails for any reason, the secondary VM seamlessly takes over without disrupting the applications or services running on it.

This provides a higher level of availability than traditional failover solutions, which typically require some amount of downtime during the failover process. With vSphere Fault Tolerance, there is no need for manual intervention or restarts, ensuring that critical applications and services remain available to end-users at all times.

With vSphere Fault Tolerance, you can create an online replication of your Virtual Machine vs Application and have zero downtime with a fully High Availability.

By eliminating even the smallest disruptions caused by server hardware failures, vSphere Fault Tolerance helps Business-critical applications to be highly available. In the event of server failure, VMware Fault Tolerance provides instantaneous, non-disruptive failover, protecting organizations from even the smallest interruption or data loss when downtime costs can reach thousands of dollars.

VMware Fault Tolerance also provides continuous availability for critical applications. When hardware fails, applications continue to run without interruptions, user disconnections, or data loss due to automatic failure detection and seamless failover. Even homegrown and custom applications can be protected by VMware Fault Tolerance, ensuring continuous availability.

How does vSphere Fault Tolerance work

vSphere Fault Tolerance works by creating and maintaining a synchronized copy of a running virtual machine (VM) on a secondary host. The secondary VM is kept in a “shadow instance” continuously synchronized with the primary VM using a technology called vLockstep.

When vSphere Fault Tolerance is enabled for a VM, the primary VM and its shadow instance are kept on separate hosts in the vSphere cluster. The shadow instance is continuously synchronized with the primary VM in real-time, mirroring all of its CPU and memory operations.

If the primary VM fails for any reason, such as a hardware failure or an operating system crash, the shadow instance takes over seamlessly, without any disruption to the applications or services running on it. This is because the shadow instance has the same state as the primary VM, including its CPU and memory contents, and can immediately continue processing from where the primary VM left off.

The takeover process is automatic and transparent to end-users, without manual intervention or restarts. Once the secondary VM takes over, it becomes the new primary VM, and a new shadow instance is created on another host in the cluster to ensure continuous availability.

vSphere Fault Tolerance provides a higher level of availability than traditional failover solutions, which may incur some downtime during the failover process. By keeping a continuously synchronized copy of the primary VM, vSphere Fault Tolerance ensures that critical applications and services remain available to end-users at all times.

The following image shows how vSphere Fault Tolerance works in your infrastructure.

The next image shows an example of vSphere Fault Tolerance failover. When an ESXi host has a problem, or the Virtual Machine stops working, vSphere Fault Tolerance automatically puts the Secondary VM online, promotes it to Primary VM, and creates a Secondary ESXi host in the next available ESXi host.

vSphere Fault Tolerance workflow rebuilds a new mirror and creates a new Primary VM and a new Secondary VM.

vSphere Fault Tolerance restrictions

While vSphere Fault Tolerance provides a high level of availability for virtual machines (VMs), there are several restrictions that you should be aware of before implementing this feature:

Limited to 8 vCPU: Fault Tolerance is limited to virtual machines with a 2 vCPU or 8 vCPU(depending on the license). This means that if your VM has multiple vCPUs, you’ll need to reduce it to use Fault Tolerance
The maximum number of Fault Tolerant VMs allowed on a host in the cluster is 4. Both Primary VMs and Secondary VMs count towards this limit. However, you can use larger numbers if the workload performs well in FT VMs
To configure vSphere Fault Tolerance, your system must meet specific requirements. This includes having sufficient CPU resources, meeting virtual machine limits, and ensuring the correct licensing. When setting up vSphere Fault Tolerance, you should also consider other factors, such as the type of workloads, the size of the VMs, and the overall performance and scalability of the environment
Limited hardware compatibility: Fault Tolerance requires specific hardware configurations to function correctly. You should consult the VMware Compatibility Guide to ensure your hardware is compatible with this feature
Limited to certain types of VMs: Fault Tolerance is not available for all types of virtual machines, such as VMs with specific devices or configurations
- Virtual machines with more than 16 virtual Disks and 2Tb size disks
- Virtual machines with more than 128 GB of memory
- Virtual machines with more than 8 virtual CPUs (vCPU)
- Virtual machines with physical RDM (Raw Device Mapping) disks
- Virtual machines with virtual RDMs in physical compatibility mode
- Virtual machines with CPU affinity configured
- Virtual machines with specific virtual devices, such as USB devices, parallel ports, and SATA controllers

For example, virtual machines with the following devices or configurations cannot be protected with Fault Tolerance:

USB devices
Parallel ports
SATA controllers

Check the following table of the vSphere Fault Tolerance max limits.

	vSphere Fault Tolerance Max Limits
	vSphere Standard	vSphere Enterprise Plus	vSphere+
VMs vCPU	2	8	8
Virtual disks	8	16	16
Disk size	2 Tb	2 Tb	2 Tb
RAM per FT VM	128 Gb	128 Gb	128 Gb
Virtual machines per host	4	4	4
Virtual CPU per host	8	8	8

vSphere Fault Tolerance Requirements

To use vSphere Fault Tolerance, you must ensure that your environment meets the following requirements:

Compatible hardware: Fault Tolerance requires specific hardware configurations to function correctly. You should consult the VMware Compatibility Guide to ensure your hardware is compatible with this feature
At the host level, the CPUs in the host machines must be compatible with vSphere vMotion and must also support Hardware MMU virtualization (Intel EPT or AMD RVI)
Virtual machine requirements: Fault Tolerance is limited to virtual machines with a 2 vCPU or 8 vCPU(depending on the license)
Network requirements: Fault Tolerance generates additional network traffic to keep the primary and secondary VMs in sync. You should ensure that your network infrastructure can handle this increased traffic and that your hosts are connected to the same network switch
Storage requirements: Fault Tolerance requires additional storage resources to store the shadow instance of the VM. You should ensure that you have enough storage capacity to accommodate the additional overhead
The hosts must have an FT-compatible storage device, such as a shared or replicated storage system, so that the FT-enabled VMs can be replicated across hosts
Host requirements: Fault Tolerance requires that the primary and secondary VMs be located on separate hosts in the vSphere cluster. Additionally, your hosts must be running in a vSphere HA cluster and have access to shared storage

The following CPUs are supported.

Intel Sandy Bridge or later. Avoton is not supported
AMD Bulldozer or later

Ensuring that your environment meets these requirements allows you to successfully implement vSphere Fault Tolerance and provide continuous availability for your virtual machines.

With the vSphere Fault Tolerance requirements, we finish this first part of VMware for Beginners – vSphere Fault Tolerance.

VMware for Beginners – What is VMware vSphere Fault Tolerance and How Does it Work?: Part 14(a)