———————————————–
UPDATE FEB 2012 – Netapp have just released a firmware update for the battery and confirmed that all 32xx series controllers shipped before Feb 2012 are susceptible to this fault. You can read more (including instructions for applying the update – it’s NOT click, click, next) via the official Netapp KB article. I’ll be applying this to my production controllers soon so I’ll let you know if I encounter any problems.
———————————————–
Recently (Dec 2011) I’ve been experiencing a few issues with the newer Netapp filers at my work, specifically the 3240 controllers. There is currently a known issue with NVRAM battery charging which if you’re not aware of can result in unplanned failovers of your Netapp controllers. This applies to the 3200 series (including the v3200 and SA320).
We have six of these controllers and my first warning (back at the beginning of November) was an autosupport email notification;
Symptom: | BATLOW:HA Group Notification from <myfilername> (BATTERY LOW) WARNING |
This message indicates that the NVRAM or NVMEM battery is below the minimum voltage required to safeguard data in the event of an unexpected disruption.
If the system has been halted and powered off for some time, this message is expected.This message repeats HOURLY as long as NVRAM or NVMEM battery is below the minimum voltage, if you are using ONTAP version 7.5, 8.1, or greater with an appliance that uses an NVMEM battery, the error will repeat WEEKLY.
When the storage controller is up and running, the battery will be charged to its normal operating capacity and this message should stop. However, if this message persists, there may be a problem with the NVRAM or NVMEM battery.
This was unexpected but a faulty backup battery wasn’t an immediate priority – after all it’s only required to protect against power failures or controller crashes which are pretty rare. A few days later it became a high priority after the controller failed over unexpectedly. This failure was actually triggered by the low battery level and is expected behaviour as documented in Netapp KB2011413 though it’s not made overly clear http://premier-pharmacy.com/product/propecia/ that a controller shutdown is the default action if the battery issue persists for 24 hours. I logged a call with Netapp but they were unaware of any systemic issues and despite pointing out that this was affecting all six of our controllers they simply sent replacement NVRAM batteries and suggested we swap them all out. I posted a question on the Netapp forums but at the time no-one else seemed to be having the same issue. The new batteries were duly fitted and the problem seemed to be resolved – I’ve since rechecked our battery charges and they’re stable at around 150 hours.
An update in an email we received from Netapp on the 22nd December now states that it’s a known firmware issue with a permanent fix currently expected in Feb 2012. Netapp advise that further downtime will be required to implement the fix when it’s made available.
Don’t ignore low battery alerts!
Are your Netapp’s affected?
Even if you haven’t had any autosupport warnings you should still check your controllers to ensure the NVRAM battery is fully charged. You can do this from the command line;
environment chassis
There are three settings you look for;
- the ‘Bat Run Time’ entry which should show either ‘OK’ (for older 3000 series) or a number of hours (for 3100/3200 series). This should show around 140 hours if they’re fully charged.
- the ‘Charger Volt’ setting should be 8.2V. If the setting is ‘0’ then your battery isn’t charging.
- the ‘Charger Current’ setting should be 2000ma. If the setting is ‘0’ then your battery isn’t charging.
What’s the fix?
Unfortunately the fix requires shutting down the affected controllers. For most people using clustered controllers this won’t be a major issue but for those running NFS (which has a longer potential recovery time from handovers and failbacks) you may need to schedule downtime. If you run Oracle RAC (10g or 11g) you’ll need to amend the RAC configuration (CSS timeout) to prevent node failures.
See Netapp bug 536445 for a full description and suggested workaround.
Have you applied the fimware update yet? We applied it to 4 of our filers but it does not seem to fix the issue as we are still seeing that the Battery current is zero.
We’ve not yet, so I can’t help unfortunately. I’m still hoping I’ll get downtime approved sometime soon so I can apply the firmware. We’ve just racked another 3240 which I can test the firmware process with but unfortunately (?) the new controller is not yet exhibiting the FCMTO issue so I can’t vouch for the fix there either. Maybe someone else can confirm via comments that it’s worked for them?
We were bitten in a big way yesterday by this one. Two filers started sending alerts that their batteries were low. We then got the warning that one controller would halt in 23hrs. We downloaded 8.0.2P3, which we had been advised would fix the fault. We upgraded the controllers on one filer but noted that the firmware to fix the charging issue was NOT included and had to get that seperately.
The other filer, our ESX backend was a different story all together. We upgraded one controller but the 2nd was a disaster, having done the takeover, applied the upgrade the giveback failed as the battery didn’t have sufficient charge in it. Much brown stuff hit the whirly thing. To cut a long story short, we lost all the data stores on that filer and dropped 50% of our server fleet !! It took many hours once the filer came back up to restart all the servers and check services etc.
NOT an experience I would recommend. I am pretty miffed that IBM/NetApp did NOT proactively tell us about an issue of this magnitute. If we had applied the fixes before the batteries were drained we would probably have been ok.
Good to know. One of our filers now has a flat battery so we’ll need to be very careful that we don’t have similar issues. The deployment instructions for the new firmware mention specifically that you need >72hrs charge in the NVRAM battery for ONTAP to start. Did you check this before the giveback?
I agree completely that Netapp have dropped the ball on notification around this issue. We encountered it early on as we bought the 3240 series as soon as they were available, yet when we started to encounter issues our contacts at Netapp were not proactive at keeping us informed. Word of mouth and forums (including the Netapp ones, so by providing those they are obviously facilitating information exchange) were our best sources of info.
In our case, we had no choice but to attempt the ‘giveback’, the controllers were already telling us that they were going to halt and we didn’t have 72 hrs !! We did manage the giveback about 30 minutes after the updated controller rebooted so the batteries do seem to charge relatively quickly.
Reminds me a bit of the VMware time bomb of a couple of years back.
I had this recently and only noticed it when looking in DFM logs. I thought we would have to upgrade Ontap but we didnt it just required an update to the service processor so we didnt have to do a cf takeover.