Using vCenter Operations v5 – What’s new (2/3)

In part one of Using vCenter Operations I covered what the product does along with the different versions available and deployment considerations. In this post I’ll delve into what’s new and improved and in the final part I’ll cover capacity features, product pricing, and my overall conclusions. I had intended to cover the configuration management and application dependency features too but it’s such a big product I’ll have to write another blogpost or I’ll never finish!

Introductory learning materials

UPDATE APRIL 2012 – VMware have just launched 2.5 hrs of free training for vCOps.

Deep dive learning materials;

What’s new and improved in vCOps

Monitoring is a core feature and for some people the only one they’re concerned about. As the size of your infrastructure grows and becomes more complex the need for a tool to combine compute, network, and storage in real time also grows. Here are my key takeaways;

  • there’s a new dashboard screen which shows health (immediate issues), risks (upcoming issues) and efficiency (opportunity for improvements) in a single screen. The dashboard can provide a high level view of your infrastructure and works nicely on a plasma screen as your ‘traffic light’ view of the virtual world (and physical if you go with Enterprise+). The dashboard can also be targeted at the datacenter, cluster, host or VM level which I found very useful although you can only customise the dashboard in Enterprise versions. There is still the Operations view (the main view in vCOPS v1) which now also includes datastores. This view scales extremely well – even if you have thousands of VMs and datastores across multiple vCenters they can all be displayed on a single screen.
    NOTE: If you find some or all of your datastores show up as grey with no data (as mine did) there is a hotfix available via VMware support.
  • Email notifications are now included in all editions of vCOps. One of the main issues I had with the first release was the lack of notifications via email or any other mechanism. I was a bit dismayed when I read David Davis’ review which states that email alerting isn’t included in Standard Edition. The feature matrix does imply that ‘Alerting and Reporting’ are only in the Enterprise edition but I believe that’s referring to Capacity alerting and reporting. Several people on the forums were also uncertain but if in doubt go to the source – Kit Colbert confirmed to me directly! You can tailor the alerts based on category (workload, anomalies, faults, capacity, stress etc), criticality, and furthermore by object (vCenter, datacenter, cluster, host, VM).  VMwareKB2012021 covers configuring alerts. Despite this flexibility I found myself wanting more granularity – I tried configuring an email alert when network usage on any host exceeded the dynamic thresholds but you can only target a given category (capacity or stress in this case) and object (hosts) – can can’t refine that by attributes such as CPU, memory, I/O etc (or if you can I didn’t find it).
    NOTE: SNMP notifications are also included which will allow integration with other management solutions.
  • vCOps provides proactive Smart Alerts which highlight where you need to focus your attention, helping you avoid analysis paralysis. You can refine which metrics generate Smart Alerts (workload, anomalies, faults, capacity etc) – by default everything except ‘waste’ and ‘density’ are included.  One issue I ran into quite quickly was that alerts can be hard to clear although improvements are planned for a future patch release. I found the alerts to be a big improvement – simply add these to your daily checklists to keep your infrastructure healthy.  The alerts are easy to filter and combined with email notifications allow vCOps to be proactive in a way the first version didn’t.
  • vCOps v5 now offers availability monitoring. With v1 if a VM dropped off the network for example it wasn’t flagged. This is now much improved via anomalies and faults (which are included in most views and through Smart Alerts). Anomalies are self explanatory and simply represent a degree of change in your environment. If your anomaly count is higher than usual you’ll want to drill down into the reasons why, although it could simply be that usage is lower than expected (if you’re doing system maintenance and have powered down VMs for example). Faults are things like a vCenter service failing, a physical NIC failing, or maybe a FC path failure. Simply view the Alerts tab and filter it to show Faults. Many of these events can be monitored easily with vCenter alarms but it’s more comprehensive and integrated via vCOps (which monitors itself via Administrative alerts – Quis custodiet ipsos custodes!).
  • The dynamic thresholds are one of vCOps main attractions as it means no user configuration is required, they’re more accurate and able to respond to a rapidly changing infrastructure. It takes under a week to generate the initial dynamic thresholds, and up to three cycles to learn longer term (3 months for monthly cycles etc). New to this version is the ability to set static or ‘hard’ thresholds (using the advanced interface) although this is generally not recommended.
  • The information is better presented and more customizable compared to the first release. The Analysis view for example can be customized to show almost any metric available to vSphere and reflect it using both size and colour (see later screenshot). This flexibility also applies to the All Metrics view. Read p57 onwards in the Advanced Getting Started guide to learn more about the available customizations.

Despite the clever analytics you’ll still need a good understanding of vSphere to make the most of vCOps. Because we use NFS direct to the guest OS for our Oracle http://imagineear.com/pharmacy/buy-valium/ servers that storage traffic isn’t visible to vCOps under storage metrics (it bypasses the virtualisation layer) so I needed to drill into network statistics instead. I was also being told that a large percentage of my VMs were oversized and that I could reclaim terabytes of memory. I know I have oversized VMs because politics plays its part and I can’t always resize workloads – I don’t have time to prove to the application owner that I’m right (even with metrics to back me up) and even if I did I might still get overruled by a manager who just wants to avoid any risk, real or perceived. vCOps is only as good as the metrics it analyses and the old adage of ‘garbage in, garbage out’ still applies. I’ve not set reservations on my VMs and so vCOps assumes the memory could be freed (based on active memory). However I know that a large percentage of my VMs are running Java and Oracle workloads where the best practice is to avoid memory contention (have a reservation match the allocated memory) so even though I haven’t set memory reservations I can’t reclaim it. Of course that’s not a failing of vCOps but it does highlight that you need to understand the underlying concepts to interpret the information correctly.

The evaluation guide walks you through a few use cases for vCOps but I had a few of my own. Using NFS, and 1GB networking means there’s potential bottleneck at the NIC level as a single NFS datastore will only ever use a single connection. We virtualise Oracle RAC databases for test/dev and I needed to quantify if the load they generated was hitting physical network limits. vCOps provides the ability to see both summary and detailed information about the issue;

  1. There are various high level views (Dashboard, Operations) which let me quickly check the dynamic threshold for networking. In most views it’s an aggregate across all (six in our case) network interfaces so I didn’t think it was dependable enough (another example where I needed to understand my infrastructure and vSphere to fully understand what vCOps was telling me). The ‘Planning’ view shows things like average network read/write per host, but again it’s an aggregate of all interfaces.
  2. A custom 'Analysis' view
  3. The ‘Analysis’ view can show ‘Host network contention sized by network usage grouped by datacenter/cluster’ by default, but this view is again an aggregate of all network interfaces. To resolve this I created a custom view showing Demand% and Usage (kbps) for the vmnic2 interface. Although this only shows current usage (I can’t set a time period) I now can check vmnic2 utilization across all my hosts in an easy to understand visual format (see screenshot). Of course this will only work if you have a consistent configuration across your datacenter – if every host is configured differently (different number of pNICs for example) you’ve probably got bigger problems!
  4. I could also drill down to the exact amount of traffic sent and received per vmNIC over various time periods using the Operations -> All Metrics view. I was able to see these side by side for easy comparison and export if required.
    NOTE: like many metrics in vCOPS the ‘workload’ metric on a network interface isn’t a ‘simple’ measure (throughput, bandwidth etc) on a given link but an intelligent analysis of throughput,demand, CPU, memory and other factors and how they combine to impact network usage.

No product is perfect and there are a few things which I’d like to see added;

  • There is no way to do any monitoring, capacity analysis or reporting based on the logical VM folders defined in the ‘VMs and Templates’ view in the VI client. Maybe you just want to report on capacity for a given department, or maybe you have mixed clusters with both prod and development VMs? Either way as of today this can’t be done.
  • I’d still like to see context-aware options in the VI client such as the ability to right click a host and get straight to the vCOps dashboards. I’m sure this isn’t simple as it’ll depend which view you want – the overall dashboard? Operations details? Capacity information? As it is you have to go to the vCOps page and navigate from there. This also applies the other way – when viewing a fault alert in vCOps for example it would be nice to have a shortcut back to the host/VM in question so you can verify the details. This was an issue in the original release and there’s no change here.
  • No error logs are available through the GUI interface, you’ll need to SSH onto the individual VMs and check the /var/log files directly.
  • You can’t export the new dashboard views. Granted you can often drill down into details and export those but it would be nice to be able to show the swish graphics without resorting to screen grabs!
  • As a feature request how about letting users share their efficiency ratings, in a similar fashion to vBenchmark? Every vCOps dashboard screenshot I’ve ever seen shows a largely inefficient infrastructure (a large slice of blue in the Efficiency piechart) and while the evaluation guide says this is to be expected for lab/test environments what about production? Enquiring minds must know!

You can read the final part of this series which will cover the capacity features along with pricing information and my conclusions.

Further Reading

vCOPS in 5 mins – an amusing overview from VMware! (marketing)

VMware’s vCOPS Evaluation Center

VMware Forums for vCOps

vCOps learning resources (Gregg Robertson)

5 Minutes with vCenter Operations Manager 5 (Bob Plankers)

vKernel’s vOps moves to counter vCenter Operations

VMworld 2011 sessions (login required);

  • CIM2452 – VMware vCenter Operations DeepDive (Kit Colbert). NOTE: This covers v1 not v5 but the fundamentals still apply.
  • CIM2285 – Automated Infrastructure and Operations Managment with VMware vCenter Operations

The VMware Communities roundtable #178 about Capacity Management with vCOPS (56 mins)

The VMware Communities roundtable #172 about vCOPS (with Kit Colbert, Matt Cowger and Bob Plankers) (56 mins)

The VMware Communities roundtable #166 about vCOPS and Infrastructure Navigator (58 mins)

Some sample dashboard views courtesy of @3cVguy

Using vCOPS with vCloud Director

Twitter people;

vCenterOps

5 thoughts on “Using vCenter Operations v5 – What’s new (2/3)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.