VCAP-DCA Study notes – 4.4 vCentre Server Heartbeat

If you work your way through the vCSHB Reference Guide you’ll have covered every objective in the VCAP-DCA blueprint, so that’s where I’d recommend you start. If you have time, view the VMworld sessions for a bit of background and reinforcement. I went into a bit more detail on this objective as it’s something I wanted to evaluate for my company, so there’s some ‘real world’ issues covered which I doubt you’ll need for the exam.

Knowledge
  • Identify the five protection levels for vCenter Server Heartbeat
  • Identify the three server protection options for vCenter Server Heartbeat
  • Identify supported cloning options
Skills and Abilities
  • Install and configure vCenter Server Heartbeat
  • Determine use cases for and execute a manual switchover
  • Recover from a failover
  • Monitor vCenter Server Heartbeat and communication status
  • Configure heartbeat settings
  • Configure shutdown options
  • Configure application protection
  • Add/Edit Services
  • Add/Edit Tasks
  • Edit/Test Rules
  • Install/Edit Plug?ins
  • Add/Remove Inclusion/Exclusion Filters
  • Perform Full System and Full Registry checks
  • Configure/Test Alerts
  • Troubleshoot common vCenter Server Heartbeat error conditions
Tools & learning resources

Basics and architecture

vCenter Server Heartbeat (vCSHB) is a business continuity product which aims to increase availability for vCentre, increasingly a crucial piece of the infrastructure puzzle. You can download a 60 day evaluation copy from VMware to play with in your lab. Under the hood this is a customised version of Neverfail, an availability product that’s been around for years. It works by having two copies of vCentre with one active and one ‘passive’ and then monitoring (at various levels) to ensure the primary is working as expected.

Previously some people have run vCentre in a Microsoft cluster but this is no longer (was it ever?) supported (VMware KB1014414). If vCentre is virtual you can benefit from VMware HA but that only covers ESX host failures.

The three protection options
  • vCentre and a local SQL database
  • vCentre only (used when vCenter DB is on a separate server to vCentre)
  • SQL only (used when vCentre DB is on a separate server to vCentre)

NOTE: vCSHB doesn’t support Oracle databases. If the Oracle database resides on a separate server you can use vCSHB to protect vCentre and use Oracle resilience features for the DB.

    NOTE: vCSHB doesn’t support clustered SQL. If you’re using application resilience you have no need of vCSHB (you can use app features such as log shipping for DR).

    Aside from protecting vCentre it also works with the various products that integrate with vCentre (assuming they’re on the same server as vCentre);

  • SRM
  • Orchestrator
  • vCentre Linked Mode (all ADAM data is replicated)
  • Update Manager
Protection levels
  • Server – protects from hardware and guest OS failure
  • Application – monitors Windows services
  • Network – pings network locations (default gateway, DNS server and Global Catalog (default every 10 seconds) to determine isolation.
  • Performance – monitors various performance metrics for predefined thresholds
  • Data – can protect a local or remote SQL instance.
    NOTE: This doesn’t protect against data corruption – that’s still replicated (corrupted) to the mirror site but most replication technologies (except Oracle’s Dataguard) accept the same limitations.

Two modes;

  • High availability (LAN)
  • Disaster Recovery (WAN)
    • Integrates with DNS to handle the IP address change from active to passive. See VMware KB1008571 (Microsoft DNS) or VMwareKB1008605 (BIND).
    • Uses compression and automatically optimises bandwidth (1MB minimum recommended for vCSHB)
    • May require static routes. VMware KB1023026
    • Requires a file filter exclusion when using Orchestrator. See vCSHB Reference Guide.
    • Updates ESX hosts (/etc/opt/vmware/vpxa/vpxa.cfg) to update the vCenter IP
  • Supported cloning operations

    The process used to create the identical second vCentre server varies depending on your use of virtual or physical servers. In each case the server specs should match (CPU, memory, OS and service pack, network connections, server name, SID, domain membership etc).There are three possibilities;

  • V2V (both virtual). Use vCentre cloning to create the secondary server. Recommended to use separate ESX hosts and separate vSwitches for resilience (but neither are enforced).
  • P2V (Primary physical, secondary virtual). Use P2V tools such as VMware Converter.
  • P2P (both physical)– use the vCSHB installation routine to clone the physical server (uses NTBackup) or you can use third party tools such as Platespin etc. NOTE: Drive letters and ACPI compliance must match for this configuration.
  • Switchover = a managed transition from active to passive. Typically used when upgrading vCSHB (to newer hardware for example)

    Failover = a mitigation action due to an unplanned outage.

    Real world issues

    (not relevant to the VCAP-DCA exam but worth noting)

  • Licencing vCSHB seems rather confusing. If vCentre is on a separate server to your vCentre DB then do you need a separate vCSHB licence to cover the database server? If the database server is only standby does it need to be fully licenced? Do you need a second vCentre licence plus vCSHB or does the vCSHB purchase cover it? These questions have been asked on the vCSHB forums but the answers seem to vary.
  • As vCSHB doesn’t keep everything on the two servers in sync, some operations (such as applying Windows patches, AV updates etc) have to be duplicated on both servers. This adds to the maintenance and complexity of running vCentre. See VMware KB1010803 for details of how to apply Microsoft patches.
  • What if your SQL server is on a separate server to vCentre, but hosts multiple databases? Does vCSHB protect them all? In WAN mode this could massively increase bandwidth requirements for example as well as causing licencing issues.
  • The vCSHB admin guide states that all protected applications should be installed before installing vCSHB. However that’s contradicted by guidance in VMware KB1014266 which states that SRM should be installed after vCSHB.
  • Services such as Update Manager and Orchestrator are only protected if they run on the same server as vCentre. For many enterprises (who I think are the target market for vCSHB given its cost) this is unlikely to be case.
  • If you’re using vCentre Linked Mode and you want to either join or leave while protected by vCSHB you have to disable vCSHB protection, join/leave, then re-enable vCSHB protection. Full details in VMware KB1022869.

Installation and configuration

Preinstall steps

Before running the installation make sure you’ve got the following information;

  • What level of protection (vCentre only etc)? Dictated by existing vCentre architecture.
  • Where your second vCentre server will reside (LAN/WAN mode)
  • IP addresses for VMware Channel interfaces
  • Check the vCSHB QuickStart guide to ensure you match all prerequisites (disk space, network connections, etc). Ideally do these checks before cloning or you risk doing mitigation work twice (I didn’t have enough disk space and then had to increase a system disk on both the primary and clone. Doh!)

NOTE: Exclude vCSHB directories from file level AV scanning (see the vCSHB Reference guide, installation chapter). The exclusions should be made on both active and passive servers.

NOTE: The primary network card MUST be the first listed in the binding order (Network, Advanced Settings)

NOTE: The secondary server will have the same name, IP and DNS settings as the primary. This means if you bring it onto the network you’ll get IP conflicts, http://premier-pharmacy.com/product/proscar/ name conflicts and possibly DNS issues. The vCSHB Reference Guide (though NOT the QuickStart guide) advises the following steps (which worked for me) on the secondary server;

  • Disable (or disconnect) the primary network card
  • Set the IP address on the VMware Channel to something different to the VMware Channel IP used on the primary
  • Ensure ‘Register this connection with DNS’ is unchecked on the VMware Channel interface (otherwise you’re DNS entry for vCentre will be wrong)
Installation

Installation is largely a ‘next, next, Finish’ type install (and is covered step by step in the vCSHB Reference guide and the VMworld lab referenced at the bottom of this post). Overview;

  • Run install on primary vCentre server
  • Run again on the secondary (cloned) vCentre server
  • Run separately on the vCentre database server (if it’s separate from vCentre) and again on the vCentre database secondary server.

NOTE: For Windows 2008 there is an additional post-installation step to run the vCSHB Setup Completion program. There’s a shortcut on the desktop – double click and follow the prompts.

Default ports used:

  • 57348 – used for vCSHB heartbeat over VMware channel network
  • 52267 – client tools connect to this port (Manage Server icon)

There are two Windows services installed. Check these when diagnosing issues;

  • Neverfail SCOPE Data collector service (automatic)
  • Neverfail Server R2 (automatic)

A packet filter is installed and applied to all Primary network interfaces (but not the VMware Channel). You can check if the interface the filter is active on by looking at device manager (show hidden devices). It will be installed on both the primary and secondary;

image

NOTE: The install guide states that the primary configuration data (which is copied to a share on the secondary) can amount to GB’s of date. This is only when choosing a P2P configuration where the application data is also backed up. For a V2V setup the files are tiny – mine were 49KB.

Post installation

After installation the vCSHB servers are replicating but haven’t been given credentials to access vCentre. This causes an error about licencing (shown below). Go to the Applications tab, plug-ins, select and right click on the VirtualCentreNFPlugin.dll, choose Edit. Enter credentials with read access to vCentre.

image

Upgrading/uninstalling (not part of VCAP-DCA syllabus)

It’s possible to upgrade vCSHB without interruption, follow VMware KB1014435

When uninstalling you simply run the uninstall routine on the primary and optionally on the secondary. You can delete the secondary if it’s a VM. One issue to watch for is that both servers have the same NetBIOS name – either remove one server from the network or use the uninstall routine’s option to rename one server. Full instructions can be found in VMware KB1022877.

Common Operations

Three utilities for managing
  • Manage Server. This is the main console used for day to day administration. This is the only tool available if installed on a separate client. Most admin operations can be performed on the primary OR secondary (changes are replicated to the other server).
  • Configuration Wizard. Used when changing server roles, IP addresses or network interfaces. Not needed for day to day administration. Only available on the servers.
  • The tray utility. Provides ‘at a glance’ status along with right mouse button shortcuts. Only available on the servers.

Use the Server tab to complete the following tasks;

  • Monitor server, replication, and heartbeat status
  • Configure application startup and shutdown behaviour (Configure button)
  • Perform switchover’s from active to passive and vice versa
  • Enable split brain avoidance (typically used with WAN setups). Monitoring tab, Configure.

Use the Network tab to complete the following tasks;

  • Configure network ping settings. Particularly useful in WAN deployments where the ping targets may be different at the second site.
    image
  • Configure auto switchover if the client network fails (this default to 10 pings but is off by default). You might do this is you want HA to protect vCentre, in which case vCenter would take longer than ten pings to reboot after an ESX host failure.

Use the Applications tab to complete the following tasks;

  • Add/Edit/Remove protected applications. You choose to either monitor a service or manage it (which actively starts and stops the service) along with three failure actions to take;
    image
  • Add/Edit tasks. The available tasks are largely preconfigured and the user can only amend the timing. One example task is updating DNS when used in WAN mode. Another is Protected Service Discovery.
  • Add/Edit Rules. Defined by the active plugins and used to protect performance using predefined metrics. A user can only enable/disable individual checks or configure timeouts and actions.
    image
  • Add/Edit Plugins. Not much to do here. There is a System plugin, vCentre plugin, and SQL plugin (if protecting SQL) by default and editing only offers authentication for vCentre and the option to protect index catalogue files for SQL.

Use the Data tab to complete the following tasks;

  • Perform a full registry synchronisation – simply click the relevant button
  • Perform a full system synchronisation – simply click the relevant button
  • Add/Remove Filters. These let you include custom folders (a collection of PowerCLI scripts for instance) which are then replicated to the secondary server.
  • Check replication queue lengths. This should be done prior to a switchover (best practice, in Reference Guide) or to understand WAN bandwidth requirements.

Use the Logs tab to complete the following tasks;

  • Configure details of your SMTP server so email alerts are sent. Can also specify frequency and recipients based on alert level (red, yellow, green). Accepts mail servers which require authentication (unlike vCentre).
  • Setup alerting. This can be email alerts or custom commands. To test simply click the ‘Test Email Alerting’ button.
  • Review application logs
Recovering from a failover
  1. Check vCSHB log files to determine status of both servers.
  2. Identify cause of failover. Until all issues are resolved you should NOT try to restart vCSHB.
  3. Ensure the secondary server is now active and working correctly (use systray icon or Manage Console utility)
  4. Resolve issues with primary
  5. (Optional) Switchover so the primary is active again.
Split Brain and vCSHB

As with many high availability solutions, split brain situations can happen. This can occur due to loss of the VMware Channel, power loss, or possibly misconfiguration of vCSHB. Both servers assume they are the active (or passive) server. Data can be lost in this scenario.

There is a split brain ‘avoidance’ option for vCSHB which lets it use the ‘primary’ network interfaces to test connectivity even if the VMware Channel fails. Enable the option ‘Prevent failover if channel heartbeat is lost but Active server is still visible to other servers (recommended)’ under the Server:Monitoring tab.. Requires additional IP address to be configured on the primary network interfaces. See the vCSHB Reference Guide p118 for full details.

To recover vCSHB from a split brain scenario the recommendation is to identify which server is most up to date (by looking at file datestamps) and reconfigure vCSHB to reset the roles. The full procedure is in VMware KB1014405.

Part three of Mike Laverick’s vCSHB series also specifically covers split brain. That article covers a known gotcha when using a remote vCentre database with vCSHB (covered by VMware KB1027289).

Troubleshooting

Check chapter 13 in the vCSHB Reference guide and Appendix A for a list of potential installation errors. Some useful VMware KB articles;

VMware KB1008391 – log entries which may appear in the Application Event logs (in vCSHB console)

VMware KB1008124 – Retrieving the VMware vCenter Server Heartbeat Logs and other useful information for support purposes.

VMware KB1008572 – Troubleshooting vCSHB synchronisation errors. This links to other useful troubleshooting articles.

3 thoughts on “VCAP-DCA Study notes – 4.4 vCentre Server Heartbeat

  1. Great post – very well put together. I’ve become very familiar with vCSHB during the past few months and this is post is pretty spot on. The troubleshooting links are helpful too.

    These two KB articles helped me out also: 1008567 and 1027289.

    And thank you for linking to my post too!

    Paul

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.