Archive

Archive for the ‘Netapp’ Category

Preventing Oracle RAC node evictions during a Netapp failover

January 23rd, 2012 No comments

While undertaking some scheduled maintenance on our Netapp shared storage (due to an NVRAM issue) we discovered that some of our Oracle applications didn’t handle the controller outage as gracefully as we expected. In particular several Oracle RAC nodes in our dev and test environments rebooted during the Netapp downtime. Strangely this only affected our virtual Oracle RAC nodes so our initial diagnosis focused on the virtual infrastructure.

Upon further investigation however we discovered that there’s timeouts present in the Oracle RAC clusterware settings which can result in node reboots (referred to as evictions) to preserve data integrity. This affects both Oracle 10g and 11g RAC database servers although the fix for both is similar. NOTE: We’ve been running Oracle 10g for a few years but hadn’t had similar problems previously as the default timeout value of 60 seconds is higher than the 30 second default for 11g.

Both Netapp and Oracle publish guidance on this issue;

The above guidance focuses on the DiskTimeOut parameter (known as the voting disk timeout) as this is impacted if the voting disk resides on a Netapp. What it doesn’t cover is when the underlying Linux OS also resides on the affected Netapp, as it can with a virtual Oracle server (assuming you want HA/DRS). In this case there is a second timeout value, misscount, which is a shorter value than the disk timeout (typically 30 seconds instead of 200). If a node can’t reach any of the other RAC nodes within misscount seconds timeframe it will start split-brain resolution and probably evict itself from the cluster by doing a reboot. When the Netapp failed over our VMs were freezing for longer than 30 seconds, causing the reboots. After we increased the network timeout we were able to successfully failover our Netapp’s with no impact on the virtual RAC servers.

NOTE: A cluster failover (CFO) is not the only event which can trigger this behaviour. Anything which impacts the availability of the filesystem such as I/O failures (faulty cables, failed FC switches etc) or delays (multipathing changes) can have a similar impact. Changing the timeout parameters can impact the availability of your RAC cluster as increasing the value results in a longer period before the other RAC cluster nodes react to a node failure.

Configuring the clusterware network timeouts

The changes need to be applied within the Oracle application stack rather than at the Netapp or VMware layer. On the RAC database server check the cssd.log logfile to understand the cause of the node eviction. If you think it’s due to a timeout you can change it using the below command;

# $GRID_HOME/bin/crsctl set css misscount 180 

To check the new settings has been applied;

# $GRID_HOME/bin/crsctl get css misscount

The clusterware needs a restart for these new values to take affect, so bounce the cluster;

# $GRID_HOME/bin/crs_stop -all
# $GRID_HOME/bin/crs_start –all

Further Reading

Netapp Best Practice Guidelines for Oracle Database 11g (Netapp TR3633). Section 4.7 in particular is relevant.

Netapp for Oracle database (Netapp Verified  Architecture)

Oracle 10gR2 RAC: Setting up Oracle Cluster Synchronization Services with NetApp Storage for High Availability (Netapp TR3555).

How long it takes for Standard active/active cluster to failover

Node evictions in RAC environment

Troubleshooting broken clusterware

Oracle support docs (login required);

  • NOTE:284752.1 – 10g RAC: Steps To Increase CSS Misscount, Reboottime and Disktimeout
  • NOTE:559365.1 – Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions
  • Note: 265769.1 – Troubleshooting 10g and 11.1 Clusterware Reboots
  • NOTE: 783456.1 – CRS Diagnostic Data Gathering: A Summary of Common tools and their Usage
Print Friendly

Netapp daily checks – available inodes/maxfiles

December 30th, 2011 1 comment

Prior to buying Netapp Operations Manager we used to run lots of daily checks to ensure the uptime and health of our Netapp controllers. Many of these checks were written using the Data ONTAP Powershell Toolkit so I thought I’d post them up in case they’re of use to anyone else.

First up is a function to check for the ‘maxfiles‘ value (the number of inodes consumed in a volume). This is typically a large number (often in the millions) and is based on the volume size, but we had an Oracle process which dumped huge numbers of tiny files on a regular basis, consuming all the available inodes. This article only covers checking for these occurrences – if you need a fix I’d suggest checking out Netapp’s advice or this discussion for possible solutions.

Simply add the function (below) to your Powershell profile (or maybe build a module) and then a Powershell one-liner can be used to check;

connect-NaController yourcontroller | get-NaMaxfiles -Percent 30

This will give you output like this;

Controller : Netapp01
Name       : test_vol01
FilesUsed  : 268947
FilesTotal : 778230
%FilesUsed : 35

Controller : Netapp01
Name       : test_vol02
FilesUsed  : 678111
FilesTotal : 1369688
%FilesUsed : 50

And here’s the function;

function Get-NaMaxfiles {
<#
.SYNOPSIS
 Find volumes where the maxfiles values is greater than a specified threshold (default 50%).
.DESCRIPTION
 Find volumes where the maxfiles values is greater than a specified threshold (default 50%).
.PARAMETER Controller
 NetApp Controller to query (defaults to current controller if not specified).
.PARAMETER Percent
 Filters the results to volumes when the %used files is greater than the number specified. Defaults to 50% if not specified.
.EXAMPLE
 connect-NaController zcgprsan1n1 | get-NaMaxfiles -Percent 30

 Get all volumes on filer zcgprsan1n1 where the number of files used is greater than 30% of the max available
#>
 [cmdletBinding()]
 Param(
 [Parameter(Mandatory=$false,
 ValueFromPipeLine=$true
 )]
 [NetApp.Ontapi.Filer.NaController]
 $Controller=($CurrentNaController)
 ,
 [Parameter(Mandatory=$false)]
 [int]
 $Percent=50
 )
 Begin {
 #check that a controller has been specified
 }
 Process {
 $exception = $null
 try {
 # create a null valued instance of $vol within the local scope
 $vols = $null
 $vols = Get-NaVol -controller $Controller -ErrorAction "Stop" | where {$_.FilesTotal -gt 0 -and ($_.FilesUsed/$_.FilesTotal)*100 -gt $Percent}
 #check that at least one volume exists on this controller
 if ($vols -ne $null) {
 foreach ($vol in $vols) {
 #calculate the percentage of files used and add a field to the Volume object with the value
 $filesPercent = [int](($vol.FilesUsed/$vol.FilesTotal)*100)
 add-member -inputobject $vol -membertype noteproperty -name Controller -value $Controller.Name
 add-member -inputobject $vol -membertype noteproperty -name %FilesUsed -value $filesPercent
 }
 }
 }
 catch {
 $exception = $_
 }
 if ($exception -eq $null) {
 $returnValue = ($vols | Sort-Object -Property "Used" -Descending | Select-Object -Property "Controller","Name","FilesUsed","FilesTotal","%FilesUsed")
 }
 else {
 $returnValue = $exception
 }
 return $returnValue
 }
}
Print Friendly
Categories: Netapp, Storage Tags: , ,

NVRAM problems on Netapp 3200 series filers

December 28th, 2011 2 comments

———————————————–

UPDATE FEB 2012 – Netapp have just released a firmware update for the battery and confirmed that all 32xx series controllers shipped before Feb 2012 are susceptible to this fault. You can read more (including instructions for applying the update – it’s NOT click, click, next) via the official Netapp KB article. I’ll be applying this to my production controllers soon so I’ll let you know if I encounter any problems.

———————————————–

Recently (Dec 2011) I’ve been experiencing a few issues with the newer Netapp filers at my work, specifically the 3240 controllers. There is currently a known issue with NVRAM battery charging which if you’re not aware of can result in unplanned failovers of your Netapp controllers. This applies to the 3200 series (including the v3200 and SA320).

We have six of these controllers and my first warning (back at the beginning of November) was an autosupport email notification;

Symptom: BATLOW:HA Group Notification from <myfilername> (BATTERY LOW) WARNING

This message indicates that the NVRAM or NVMEM battery is below the minimum voltage required to safeguard data in the event of an unexpected disruption.

If the system has been halted and powered off for some time, this message is expected.This message repeats HOURLY as long as NVRAM or NVMEM battery is below the minimum voltage, if you are using ONTAP version 7.5, 8.1, or greater with an appliance that uses an NVMEM battery, the error will repeat WEEKLY.

When the storage controller is up and running, the battery will be charged to its normal operating capacity and this message should stop. However, if this message persists, there may be a problem with the NVRAM or NVMEM battery.

This was unexpected but a faulty backup battery wasn’t an immediate priority – after all it’s only required to protect against power failures or controller crashes which are pretty rare. A few days later it became a high priority after the controller failed over unexpectedly. This failure was actually triggered by the low battery level and is expected behaviour as documented in Netapp KB2011413 though it’s not made overly clear that a controller shutdown is the default action if the battery issue persists for 24 hours. I logged a call with Netapp but they were unaware of any systemic issues and despite pointing out that this was affecting all six of our controllers they simply sent replacement NVRAM batteries and suggested we swap them all out. I posted a question on the Netapp forums but at the time no-one else seemed to be having the same issue. The new batteries were duly fitted and the problem seemed to be resolved – I’ve since rechecked our battery charges and they’re stable at around 150 hours.

An update in an email we received from Netapp on the 22nd December now states that it’s a known firmware issue with a permanent fix currently expected in Feb 2012. Netapp advise that further downtime will be required to implement the fix when it’s made available.

Don’t ignore low battery alerts!

Read more…

Print Friendly

VMworld Copenhagen – Day one summary

October 19th, 2011 1 comment

Today was officially the start of VMworld Copenhagen even though many people were here yesterday for partner day. The hands on labs are always popular at VMworld shows, and for all the reasons previously covered by others. I’ve done two labs so far (HOL01, Creating the Hybrid Cloud and HOL27, Netapp and VMware) which were both useful in different ways. There’s a good atmosphere and the technology behind the labs continues to evolve – this year vCenter Operations (and I think Netapp Insight Balance) are on display showing how the lab infrastructure is performing. There are more seats and the labs are open longer than last year (32 hours) which is good to see.

I spent fair bit of time in the bloggers lounge, a small dedicated area with power, a separate wifi connection, and facilities for VMworld TV to broadcast live from. This is where you can often find John Troyer, the godfather of VMware’s social media scene along with many of the twittter names you’ve seen but never met in person. VMworld is a vertitable ‘who’s who’ of the virtualisation world – I found myself sitting next to Scott Lowe for ten minutes before realising who he was and saying hi! Many of the people hanging around the bloggers lounds have been at VMworld many times so it’s a good place to get a feel for what’s hot and what’s not at this year’s conference. I got my first taste of VMworld TV via an invite to vSoupTV. Quite a few people mentioned that it felt quieter this year but as the attendance has been confirmed at over 7,000 it must be because there’s more space rather than less people.

The centre of the complex is used as a relaxation zone complete with plenty of seating, food, recliners (for those quick power naps), table tennis, table ice hockey, chess sets etc. It’s a good place to meet people as you pass through on your way from a general session to the labs. Free wifi is available throughout the Bella Centre but unfortunately it’s pretty temperamental – somewhat expected for a large conference with over 7000 people. That wouldn’t be so bad but the VMworld iPhone app relies on internet access so when that’s not working you can’t reference your schedule or register for sessions. When it does work the VMworld iPhone app is pretty good – you can check for upcoming sessions, get a filtered twitter stream for a given session, and even check site maps. Read more…

Print Friendly

Netapp certification – the NCDA

October 18th, 2011 No comments

I’ve been asked by a few people over the last few weeks about the Netapp Certified Data Administrator certification, better known as the NCDA. I was only exposed to Netapp technology a few years ago so definitely don’t claim any real expertise – I don’t know if these requests are due to an increased demand for engineers with Netapp knowledge or whether I’ve just surrounded myself with like minded people pursuing similar goals. Hopefully both!

When I took my exams a couple of years ago I considered putting together a study guide as there wasn’t much available and it suits the way I learn new material. Hanging out on the Netapp forums I picked up quite a few hints and tips along with some great links to example questions, web based learning and some documents produced by Netapp which summarise the knowledge you need for the exams. I never found the time (or motivation if truth be told) to put together my own study notes but maybe there’s still enough demand to make a collection of resources useful. As always real world, hands on experience is invaluable but the below are worth your time;

NOTE: I took two exams (ns0-153 and ns0-163) whereas you can now take a single exam (ns0-154) instead which covers ONTAP 8 – 7 mode.

In terms of difficulty the NCDA is an entry level exam – I’d put these exams nearer the VCP standard than the VCAP, more like an MCP than an MCSE. They’re multiple choice and while some questions require enterprise design knowledge (I got one on Metrocluster cabling) most are much more basic. Like the Cisco exams your certification expires after two years so I should be retaking the exams if I want to stay current but as not much has changed (ONTAP v8 is out but running in cluster mode means it’s almost the same as v7) it would be a paper exercise only and hence not worth it. Besides there’s SRM, vSphere5, vCD 1.5, Chef, Puppet and more to learn should I find any free time…

Print Friendly
Categories: Netapp, Storage Tags:

Netapp Powershell Toolkit 1.5 released

July 29th, 2011 No comments

For those who work with with Netapp storage you’re probably familiar with the Netapp Powershell Toolkit. This fantastic free resource lets you easily create and run scripts against your filers using Powershell. We have a variety of filers both 2000 and 3000 series and while Netapp Operations Manager is pretty good at managing filers centrally there are times when you want specific functionality that’s not available out of the box. We’ve used the Toolkit to automate things such as;

  • Correctly set volume options, check for offline volume, % max files used etc
  • Email a weekly report on snapshot usage, ASIS efficiency etc
  • Automate storage provisioning – create volumes, set options, set NFS exports and even populate the /etc/fstab file within the guest OS. This is a massive time saver when building twenty Oracle RAC servers!

Look out in the near future as I’m planning a blogpost about how we automate our provisioning – there’s some good stuff in there! Netapp have a white paper aimed at beginners to Powershell and the Netapp Toolkit – check out TR-3896.

Today (29th July 2011) v1.5 of the Toolkit has been released which adds the following features (amongst others);

  • Storage efficiency calculations. This will enable me to generate weekly reports on how effective our thin provisioning is for example.
  • ONTAP log parsing and monitoring.
  • Disk (LUN) signature manipulation. This lets you set a new signature on a LUN before presenting it to a host. We mainly use LUNs with VMware hosts which can be scripted (using PowerCLI) to resignature LUNs anyway, but I’m sure there are circumstances where this would be useful.

Check out the full list of new features here. You’ll need to login with a Netapp NOW account (Netapp On the Web) to download the toolkit. Since it’s release a year ago it’s been regularly updated with requested functionality – the developers are definitely listening to customers.

If you prefer a GUI based approach but still want all the customisable goodness scripting can offer, you can now use the Netapp Toolkit PowerGUI Powerpack by Glenn Sizemore. Simply download the powerpack from the PowerGUI website and import it into the freely available PowerGUI and you can point and click you way around. There’s even a video of Glenn showing how it works – not exactly a tutorial but gives you an idea at least!

Print Friendly