Category Archives: Netapp

InTechWeTrust episode 32 – Netapp, containers, AWS and more

I recently attended the tech.unplugged event in London (my thoughts on it are here) and the London VMUG the following day, and was in the right place at the right time to take part in the InTechWeTrust podcast, episode 32. For those not familiar with this podcast it’s run by a prominent team of bloggers who have a background in enterprise infrastructure and has been going since last September. You can listen to the podcast directly via the player below or your usual choice of subscription (iTunes etc) – just head on over to the InTechWeTrust website for all the links.

Make sure you listen to the last 15 mins with EMEA CTO Joe Baguley – very interesting.

InTechWeTrust Episode 32 – Containers, Project Photon/Lightwave, AWS, Netapp, CoHo Data + more!

I’d like to use this blogpost to follow up on some of the topics discussed and my contributions.

...on ‘containers’. Sometimes I find myself speaking on a topic of which I’m by no means an expert – I try to avoid it as I’m all about facts, impartiality (as far as that’s possible) and I’m a believer that your reputation is sacrosanct (not just in the bloggersphere) but you can’t learn without getting out of your comfort zone. I’m not a developer. I have limited knowledge and minimal hands-on experience of containers. I have an understanding on where they fit into an overall architecture, who’s getting value from them, and at least an inkling of their potential but I’m clearly no expert. My comments about Docker building a platform (with an implied degree of vendor lock-in) vs Rocket’s ‘more open’ ambitions largely came from reading this blogpost from Rocket, this great Reddit thread discussing what it means, plus a good http://premier-pharmacy.com/product/strattera/ summary from GigaOm. Clearly this still needs to play out – the stakes are high and it’s going to be an interesting ride!
If anyone can point me to other resources with more information I’d be very grateful!

…on Photon/Lightwave/Photon. This was discussed with Joe Baguley after I’d left the podcast but the interesting soundbites for me were ‘a new direction for VMware’, the fact that containers are seen to be the boundary between VMware and Pivotal (hence why Photon/Lightwave are VMware yet Lattice is Pivotal), and the idea that containers may become embedded in vSphere itself. Interesting times!

…on Netapp. There’s been a recurring discussion about Netapp on the last few episodes and a good Linked-In discussion. I was a Netapp user for over five years (and I’ve written quite a few Netapp blogposts) and while I’ve not kept an eye on their latest releases I’ve always felt they weren’t vocal enough in the social media space, especially since Vaughn Stewart jumped ship to Pure Storage. This has improved with Nick Howell’s useful DatacentreDude blog and podcast but I still don’t see enough innovation. Flash, tiering, and scale out have all been addressed but never in a convincing way – the gravity of the core ONTAP OS seems all consuming. This would seem to be borne out in their upcoming layoffsAgain, happy to be educated otherwise!

…AWS finances. They’re now available – plenty of articles to digest. As predicted it made the mainstream BBC news, Simon Wardly waded in, and there’s a good Business Insider article with a great quote;

Amazon? The online bookstore that turned into a kind of Best Buy/Wal-Mart online? A giant of enterprise computing? No way.

 

Netapp ONTAP 8.2 and SnapManager compatibility

Summary: Running SnapDrive or SnapManager on Windows 2003? You might have some decisions to make….

Netapp recently announced that ONTAP 8.2 will bring with it a new licencing model which impacts the SnapDrive and SnapManager suites. Unfortunately this could have significant impact on companies currently using those products so you need to be familiar with the changes. In KB7010074 (NOW access required) it clearly states that current versions (when running on Windows) don’t work with ONTAP 8.2;

Because of changes in the licensing infrastructure in Data ONTAP 8.2, the license-list-info ZAPI call used by the current versions of SnapDrive for Windows and the SnapManager products is no longer supported in Data ONTAP 8.2. As a result, the current releases of these products will not work with Data ONTAP 8.2.

 The SnapManager products mentioned below do not support ONTAP 8.2.

  • SnapDrive for Windows 6.X and below
  • SnapManager® for Exchange 6.X and below
  • Single Mailbox Recovery® 6.X and below
  • SnapManager for SQL® 6.X and below
  • SnapManager for SharePoint® 7.x and below
  • SnapManager for Hyper-V 1®.x

Unfortunately there is no workaround and we need to wait for future versions of SnapManager and SnapDrive to be released sometime in 2013 (according to the KB article) before we get ONTAP 8.2 compatibility. I’ve no major issue with this situation as ONTAP 8.2 was only released a few days ago for Cluster http://premier-pharmacy.com/product/avodart/ mode and isn’t even released yet for 7 mode customers.

If you’re using Windows 2003 with any of the above products however this could be a big deal. SnapDrive 6.5 (the latest as of June 2013) only supports Windows 2008 and newer so it’s a reasonably assumption that the newer releases will have similar requirements. Until now you could still use SnapDrive 6.4 if you needed backwards compatibility with older versions of Windows – I suspect Windows 2003 is still plentiful in many enterprises (as well as my own). Now though you have a hard choice – either upgrade the relevant Windows 2003 servers, stop using the Snap products, or accept that you can’t upgrade ONTAP to the 8.2 release.

Personally I have a bunch of physical clusters all running Windows 2003 and hosting mission critical SQL databases and if these dependencies don’t change I’ll have to accelerate a project to upgrade them all in the next year or so, something that currently has no budget. Software dependencies aren’t unique to Netapp nor are Netapp really at fault – upgrading software is part of infrastructure sustainability and Windows 2003 is ten years old.

Lesson for the day: Running old software brings with it a risk.

Netapp OnCommand System Manager 2.1 available

A quick post to say that Netapp have released v2.1 of their Windows MMC management  tool, OnCommand System Manager (the download link is the bottom right, NOW account required). This new update brings the usual incremental fixes along with support for Flash Pools, Infinite Volumes (a feature of ONTAP 8.1.1 in cluster mode), and multidisk carrier shelves. It’s also moved to a 64 bit architecture – my ‘upgrade’ simply uninstalled the 32bit version and installed the 64 bit one.

For compatibility the release notes state;

  • Data ONTAP 7.3.x (starting from 7.3.7)
  • Data ONTAP 8.0 or later in the 8.0 release family operating in 7-Mode
  • Data ONTAP 8.1 or later in the 8.1 release family operating in 7-Mode
  • Data ONTAP 8.1 or later in the 8.1 release family operating in Cluster-Mode

However checking the Netapp compatibility matrix shows that this release is ‘officially’ supported on a smaller http://premier-pharmacy.com/product-category/gastrointestinal/ number of ONTAP releases, notably ONTAP 7.3.7 or newer (excluding 7.3.4 etc) and 8.03 or newer (excluding 8.01, 8.02 etc). I suspected this was simply timing and that once the new release has been around for longer it would be validated against more ONTAP releases. However I tried it against a few of my filers running a mixture of 8.01p2 and 8.02P6 and found one issue straightaway. The new network checker wouldn’t run against the 8.01p2 controllers as apparently they don’t support the necessary API calls.

If you’re running some of these older ONTAP releases proceed with caution!

I’ve also noticed that there is now a System Manager build which will run natively on Mac OSX although it’s not officially supported – how many people will use this at their own risk I wonder?

Netapp and vSphere5 storage integration

Let your storage array do the heavy lifting with VAAI!

I’ve seen a few blogposts recently about storage features in vSphere5 and plenty of forum discussions about the level of support from various vendors but none that specifically address the Netapp world. As some of these features require your vendor to provide plugins and integration I’m going to cover the Netapp offerings and point out what works today and what’s promised for the future.

Many of the vSphere5 storage features work regardless of your underlying storage array, including StorageDRS, storage clusters, VMFS5 enhancements (provided you have block protocols) and the VMware Storage Appliance (vSA). The following vSphere features however are dependent on array integration;

  • VAAI (the VMware Storage API for Array Integration). If you need a refresher on VAAI and what’s new in vSphere v5 check out these great blogposts by Dave Henry part one covers block protocols (FC and iSCSI), part two covers NFS. The inimitable Chad Sakac from EMC also has a great post on the new vSphere5 primitives.
  • VASA (the VMware Storage API for Storage Awareness). Introduced in vSphere5 this allows your storage array to send underlying implementation details of the datastore back to the ESXi host such as RAID levels, replication, dedupe, compression, number of spindles etc. These details can be used by other features such as Storage Profiles and StorageDRS to make more informed decisions.

The main point of administration (and integration) when using Netapp storage is the Virtual Storage Console (VSC), a vCenter plugin created by Netapp. If you haven’t already got this installed (the latest version is v4, released March 16th 2012) then go download it (NOW account required). As well as the vCenter plugin you must ensure your version of ONTAP also supports the vSphere functionality – as of April 19th 2012 the latest release is ONTAP 8.1. You can find out more about its featureset from Netapp’s Nick Howell. As well as the core vSphere storage features the VSC enables some extra features;

These features are all covered in Netapp’s popular TR3749 (best practices for vSphere, now updated for vSphere5) and the VSC release notes.

Poor old NFS – no VAAI for you…

It all sounds great! You’ve upgraded to vSphere5 (with Enterprise or Enterprise Plus licensing), installed the VSC vCenter plugin and upgraded ONTAP to the shiny new 8.1 release. Your Netapp arrays are in place and churning out 1’s and 0’s at a blinding rate and you’re looking forward to giving vSphere some time off for good behaviour and letting your Netapp do the heavy lifting…..

Continue reading Netapp and vSphere5 storage integration

Preventing Oracle RAC node evictions during a Netapp failover

While undertaking some scheduled maintenance on our Netapp shared storage (due to an NVRAM issue) we discovered that some of our Oracle applications didn’t handle the controller outage as gracefully as we expected. In particular several Oracle RAC nodes in our dev and test environments rebooted during the Netapp downtime. Strangely this only affected our virtual Oracle RAC nodes so our initial diagnosis focused on the virtual infrastructure.

Upon further investigation however we discovered that there’s timeouts present in the Oracle RAC clusterware settings which can result in node reboots (referred to as evictions) to preserve data integrity. This affects both Oracle 10g and 11g RAC database servers although the fix for both is similar. NOTE: We’ve been running Oracle 10g for a few years but hadn’t had similar problems previously as the default timeout value of 60 seconds is higher than the 30 second default for 11g.

Both Netapp and Oracle publish guidance on this issue;

The above guidance focuses on the DiskTimeOut parameter (known as the voting disk timeout) as this is impacted if the voting disk resides on a Netapp. What it doesn’t cover is when the underlying Linux OS also resides on the affected Netapp, as it can with a virtual Oracle server (assuming you want HA/DRS). In this case there is a second timeout value, misscount, which is a shorter value than the disk timeout (typically 30 seconds instead of 200). If a node can’t reach any of the other RAC nodes within misscount seconds timeframe it will start split-brain resolution and probably evict itself from the cluster by doing a reboot. When the Netapp http://pharmacy-no-rx.net/levitra_generic.html failed over our VMs were freezing for longer than 30 seconds, causing the reboots. After we increased the network timeout we were able to successfully failover our Netapp’s with no impact on the virtual RAC servers.

NOTE: A cluster failover (CFO) is not the only event which can trigger this behaviour. Anything which impacts the availability of the filesystem such as I/O failures (faulty cables, failed FC switches etc) or delays (multipathing changes) can have a similar impact. Changing the timeout parameters can impact the availability of your RAC cluster as increasing the value results in a longer period before the other RAC cluster nodes react to a node failure.

Configuring the clusterware network timeouts

The changes need to be applied within the Oracle application stack rather than at the Netapp or VMware layer. On the RAC database server check the cssd.log logfile to understand the cause of the node eviction. If you think it’s due to a timeout you can change it using the below command;

# $GRID_HOME/bin/crsctl set css misscount 180 

To check the new settings has been applied;

# $GRID_HOME/bin/crsctl get css misscount

The clusterware needs a restart for these new values to take affect, so bounce the cluster;

# $GRID_HOME/bin/crs_stop -all
# $GRID_HOME/bin/crs_start –all

Further Reading

Netapp Best Practice Guidelines for Oracle Database 11g (Netapp TR3633). Section 4.7 in particular is relevant.

Netapp for Oracle database (Netapp Verified  Architecture)

Oracle 10gR2 RAC: Setting up Oracle Cluster Synchronization Services with NetApp Storage for High Availability (Netapp TR3555).

How long it takes for Standard active/active cluster to failover

Node evictions in RAC environment

Troubleshooting broken clusterware

Oracle support docs (login required);

  • NOTE:284752.1 – 10g RAC: Steps To Increase CSS Misscount, Reboottime and Disktimeout
  • NOTE:559365.1 – Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions
  • Note: 265769.1 – Troubleshooting 10g and 11.1 Clusterware Reboots
  • NOTE: 783456.1 – CRS Diagnostic Data Gathering: A Summary of Common tools and their Usage