Objective 7.2 – Enable a Fault Tolerant Virtual Machine Print E-mail
Written by Matthijs van den Berg   
Tuesday, 10 November 2009 01:31

Knowledge


  • Identify FT restrictions
    There are ‘some’ restrictions to the use of VMware Fault Tolorance. Those are for the ESX host system
    • You can only apply it to VMs with one processor
    • You need a working cluster (HA and DRS with shared storage) as a FT VM only resides on 1 data store
    • You need a HV-Compatible CPU
    • You need exactly the same CPU on both ESX hosts
    • All hosts must use the same version of ESX
    • It does not prevent you from a faulty storage array (or a filled up VMFS)
    • You need a dedicated NIC for FT logging (Gbit or better)
    • Advanced or higher license
    • 5% to 20% overhead!
    And for the VM you would like to apply FT to:
    • No thin provisioned disk (auto upgraded to thick disk)
    • No Storage VMotion
    • A small performance penalty (keeping the VM in lockstep) See here
    • No non-replayable devices (USB, Sound, Physical CD-ROM, etc.)
    • No Para Virtualisation enabled
    • No Para Virtualized SCSI (PVSCSI)
    • Will use their full memory reservation!
    • No VMXNET 3 NIC support
    • VM OS must be supported (Operating system support may very based on the processor your ESX host uses )
    • No support for Snapshots
    • No MSCS support (remove before enabling FT)
    • No support for CPU / Mem hot plug
    • Extended Page Tables (EPT)/Rapid Virtualization Indexing (RVI) is automatically disabled
    Read these and more limitations in a blog posting here.
  • Evaluate FT use cases
    Despite the large number of restrictions FT is a strong feature of ESX / vSphere to protect you VM against unplanned ESX host failure. But remember that this is the only thing this feature protects you from; ESX failures like network, server, disk I/O etc. When the central storage or Guest OS failes FT is not going to help you! A Cluster like MSCS would be a better solution to counter a guest OS failure but is a expensive solution that cannot be used for all applications. And it are those applications that are best suited for FT when they do fit between the restrictions FT has.
  • Set up an FT network
    A FT VM uses the vLockstep technique to keep the VMs on both ESX hosts in sync.
    FT generates two types of network traffic:
    • Migration traffic to create the secondary virtual machine
    • FT logging traffic
    Migration traffic happens over the NIC designated for VMotion and it causes network bandwidth usage to spike for a short time. Separate and dedicated NICs are recommended for FT logging traffic and VMotion traffic, especially when multiple FT virtual machines reside on the same host. Sharing the same NIC for both FT logging and VMotion can affect the performance of FT virtual machines whenever a secondary is created for another FT pair or a VMotion operation is performed for any other reason.

    VMware vSwitch networking allows you to send VMotion and FT traffic to separate NICs while also using them as redundant links for NIC failover.

    Adding multiple uplinks to the virtual switch does not automatically result in distribution of FT logging traffic. If there are multiple FT pairs, then traffic could be distributed with IP-hash based load balancing policy, and by spreading the secondary virtual machines to different hosts. Remember that IP hashed based load balancing require switch configuration.

    To calculate the required bandwidth use the formula:
FT logging bandwidth ~= [ (Average disk read throughput in Mbytes/sec * 8) + 
Average network receives (Mbits/sec) ] * 1.2
  • To setup a network for FT logging
    • Select an ESX host
    • Click tha tab “Configuration”
    • Select “Networking” from the right menu
    • Select an existing vSwitch with sufficient free bandwidth or create a new vSwitch with dedicated NICs
    • Click “Add” on the “Ports” tab
    • Select the VM Kernel Bullet
      ftlogging
    • Enter a name and optionally a VLAN
    • Check the box “Use this portgroup for Fault Tolerant logging”
    • Finish the network
  • Verify requirements of operating environment
    There are some requirements and recommendations to the guest OS of a FT VM:
    • The guest OS must be supported by VMware for FT
    • Make sure that the guest OS uses an NTP server for time sync
    • FT does not perform well when used on a VM with a large amount of I/O
  • Enable FT for a virtual machine
    There are two types of FT operations that can be performed on a virtual machine: Turning FT on or off, and enabling or disabling FT. The performance implications of these operations are as follows:
    • “Turn On FT” prepares the virtual machine for FT.
      • Non Supported devices are removed
      • Balooning is being disabled
      • The SWAP file is deleted
      • Hardware MMU is being disabled (shutdown required)
    • “enable FT” operation enables Fault Tolerance by live-migrating the virtual machine to another host to create a secondary virtual machine.
      When “Turn on FT” operation succeeds for a virtual machine that is already powered on, it automatically creates a new secondary virtual machine. So it has the same effect as “Enabling FT”. This uses a substantial amount of resources. Keep turning on / off to a minimum. When not enough resources are available to process is terminated.
  • Test an FT configuration
    Thank you Henrique: There are two build in methods to test FT. Right-click the VM and there'll be a Fault Tolerance option.
    • Test Failover: Primary and Secondary VMs switch roles
    • Test Restart Secondary: After restarting it, you can check its consistency compared to the original
  • Upgrade ESX hosts containing FT virtual machines
    This sound easy but must be handled with care! VMware FT requires two ESX hosts to have the exact same patch level! So when updating you need to make sure this is either the case or not necessary. VMware recommends the following two update scanario’s:
    • Disable FT (longer downtime)
      The first scenario is easy, disable (not turning off, this takes longer!) FT, upgrade and enable again. Disabling only takes seconds, enabling can take several minutes.
    • Disable FT with multiple hosts (shorter downtime)
      The second scenario requires at least four hosts. In short:
      • Update the two hosts not in use by FT VMs and check the levels are exactly the same
      • Disable FT (turning off would take longer)
      • VMotion the FT machine to an updates ESX host
      • Enable FT. A replica is automatically created on the ESX host with the same patch level.

Tools

 

 

Matthijs’ Links

 

 

 

 

VCP4 Studie Guide - Fast Find