Objective 8.2 – Perform Basic Troubleshooting for VMware FT and Third-Party Clusters Print E-mail
Written by Matthijs van den Berg   
Monday, 07 December 2009 22:02

Knowledge

  • Analyze and evaluate VM population for maintenance mode considerations
    As explained before in chapter 7.2 you need to take special care before placing a host in maintenance mode when using FT. Because a FT VM requires two instead of one active host you need to make sure that you have at least two hosts supporting FT and having exactly the same configuration and version number. If you cannot meet these criteria you disable FT (remember that disabling is faster than turing it off).
  • Understand manual Third-Party failover/failback processes
    When using other fail-over techniques VMware provides you with nothing more or less than hardware virtualization (assuming that FT is not used). There are many 3th party techniques to provide fail-over scenarios on the guest (VM) level. Every tool has its own requirements and procedures to fail-over.

    A small piece of extra info:
    Prehaps this question is about the fail-over of ESX hosts from one site to another. There is a possibility to create scripts to seach the replicated LUNs and import the VM found into you vCenter / vSphere environment. This includes the manual action:
    • Replicates LUNs with VMs
    • Make sure there is enough capacity to start all VMs on one site
    • Create a shell script to search LUNs / VMFS volumes and import the VMs
    • Create a script to start the VMs in a particular order with pauses between the start-up.
  • Troubleshoot Fault Tolerance partial or unexpected fail-overs
    You can use the information provided in the appendix Fault Tolerance Error Messages to help you troubleshoot Fault Tolerance. The topic contains a list of error messages that you might encounter when you attempt to use the feature and, where applicable, advice on how to resolve each error.
    • Storage Related Errors
      When you loose / partial loose / have slow storage you VM might experience errors. For those errors te solve first solve the underlying storage errors.
    • Network related errors
      When the logging NIC is not functioning the FT VM might fail-over. To solve this dedicate a separate NIC for fail-over and vMotion.
    • Network related errors – bandwidth issues
      When there is not enough bandwidth to send all transactions to fail-over hosts you FT VM fail. To solve this reduce the number of FT VMs, increase the bandwidth of, the easiest solution, distribute the FT VMs and their copies more evenly over all hosts.
    • vMotion failure
      When you vMotion a VM that is busy the vMotion might fail. This might cause a FT VM to failover. To avoid this VMware recommends only to vMotion VMs when the VM is not that busy… so night work ;-)
    • File system – too much IO
      When a file system is handling too much IO this might cause a FT VM to fail-over. Since the failed over VM will use the same storage array this is not going to help, so to solve this the only way is to reduce the number of IOs on a VMFS volume. To see is a VMFS volume is over utilized check for warnings about SCSI reservations in the VMkernel log.

Tools

 

VCP4 Studie Guide - Fast Find