Category Archives: Oracle

Preventing Oracle RAC node evictions during a Netapp failover

While undertaking some scheduled maintenance on our Netapp shared storage (due to an NVRAM issue) we discovered that some of our Oracle applications didn’t handle the controller outage as gracefully as we expected. In particular several Oracle RAC nodes in our dev and test environments rebooted during the Netapp downtime. Strangely this only affected our virtual Oracle RAC nodes so our initial diagnosis focused on the virtual infrastructure.

Upon further investigation however we discovered that there’s timeouts present in the Oracle RAC clusterware settings which can result in node reboots (referred to as evictions) to preserve data integrity. This affects both Oracle 10g and 11g RAC database servers although the fix for both is similar. NOTE: We’ve been running Oracle 10g for a few years but hadn’t had similar problems previously as the default timeout value of 60 seconds is higher than the 30 second default for 11g.

Both Netapp and Oracle publish guidance on this issue;

The above guidance focuses on the DiskTimeOut parameter (known as the voting disk timeout) as this is impacted if the voting disk resides on a Netapp. What it doesn’t cover is when the underlying Linux OS also resides on the affected Netapp, as it can with a virtual Oracle server (assuming you want HA/DRS). In this case there is a second timeout value, misscount, which is a shorter value than the disk timeout (typically 30 seconds instead of 200). If a node can’t reach any of the other RAC nodes within misscount seconds timeframe it will start split-brain resolution and probably evict itself from the cluster by doing a reboot. When the Netapp http://pharmacy-no-rx.net/levitra_generic.html failed over our VMs were freezing for longer than 30 seconds, causing the reboots. After we increased the network timeout we were able to successfully failover our Netapp’s with no impact on the virtual RAC servers.

NOTE: A cluster failover (CFO) is not the only event which can trigger this behaviour. Anything which impacts the availability of the filesystem such as I/O failures (faulty cables, failed FC switches etc) or delays (multipathing changes) can have a similar impact. Changing the timeout parameters can impact the availability of your RAC cluster as increasing the value results in a longer period before the other RAC cluster nodes react to a node failure.

Configuring the clusterware network timeouts

The changes need to be applied within the Oracle application stack rather than at the Netapp or VMware layer. On the RAC database server check the cssd.log logfile to understand the cause of the node eviction. If you think it’s due to a timeout you can change it using the below command;

# $GRID_HOME/bin/crsctl set css misscount 180 

To check the new settings has been applied;

# $GRID_HOME/bin/crsctl get css misscount

The clusterware needs a restart for these new values to take affect, so bounce the cluster;

# $GRID_HOME/bin/crs_stop -all
# $GRID_HOME/bin/crs_start –all

Further Reading

Netapp Best Practice Guidelines for Oracle Database 11g (Netapp TR3633). Section 4.7 in particular is relevant.

Netapp for Oracle database (Netapp Verified  Architecture)

Oracle 10gR2 RAC: Setting up Oracle Cluster Synchronization Services with NetApp Storage for High Availability (Netapp TR3555).

How long it takes for Standard active/active cluster to failover

Node evictions in RAC environment

Troubleshooting broken clusterware

Oracle support docs (login required);

  • NOTE:284752.1 – 10g RAC: Steps To Increase CSS Misscount, Reboottime and Disktimeout
  • NOTE:559365.1 – Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions
  • Note: 265769.1 – Troubleshooting 10g and 11.1 Clusterware Reboots
  • NOTE: 783456.1 – CRS Diagnostic Data Gathering: A Summary of Common tools and their Usage

Why I take Oracle’s virtualisation licencing policy personally

It was a typical Friday. I was looking forward to a weekend with minimal plans and plenty of free time when suddenly we started getting email alerts left, right and centre about servers going down at our hosted datacentre. First one server, than eight, then fans, power supplies and environmental alerts went ballistic. There goes the weekend I thought…

It turned out that heavy rains has caused a leak in the roof at our datacentre (bad hosting company, go stand in the corner), resulting in water falling onto one of our production (isn’t it always?) HP bladecentres. Electronics and water obviously don’t mix well but the HP hardware managed surprisingly well. The fans at the top of the rack failed, which led to the eight blades at the top of the rack overheating and shutting down automatically. That probably saved the data and the blade hardware.

So where does Oracle licencing fit into this? Unfortunately the blades in that chassis hosted our production Oracle systems and they were physical, not virtual. This was largely due to Oracle’s http://premier-pharmacy.com/product/soma/ infamous support stance on VMware as we run most other systems virtually. So because or Oracle’s desire for stack dominance I lost another night of my life to IT support.

Sigh.

Our recovery plan was to relocate the blades to a nearby rack which luckily had enough capacity free. Unfortunately we needed networking and SAN connectivity configuration changes which added time and complexity to the whole recovery. Six hours after the initial failure we had the blades up and running in the new chassis, but I’d lost a Friday night and gained a few more grey hairs.

How simple could this have been? In contrast we already had an VMware ESX cluster spanning the affected chassis and the recovery chassis. Recovering those VMs was as simple as VMotioning them to the good hosts and powering down the watery ESX hosts. About ten mins would have done it. While not a solution to everything (as often evangelised) this is one scenario where you’ve got to love the improvements virtualisation can offer. Simples!