On Friday morning we woke to a nasty bit of news. That is, all 4 of our Exadata machines were down! That's right, development/test, production support (QA), production and production prime (standby) all crashed! Talk about Murphy's Law.
We are still doing dry run migrations and testing so fortunately they are not live yet, but it does not give the customer a warm and fuzzy. What if we were live!? So of course questions started being asked, and flags went up as to if this platform migration was a mistake, and we had to do damage control. It did not help matters that we had just signed a managed services contract and it took so longer for notification (almost 24 hours) and resolution (another 6+ hours). In defense, the contract was only signed a few days to a week before this incident and Grid Control was not yet available due to timing and some build issues (but we had signed, so where is our SLA?).
The potential impact was it was to be our final test migration before development cut-over. If the machine was down our already tight timelines would be severely impacted. Fortunately, service was restored and we got through the migration successfully. So all is well that ends well.
So what happened?
It seems the IB network was fluctuating which meant that all communication over this network was effectively down. That of course means no RAC interconnect, no ASM, no Storage Servers, nothing. Given the only thing in common between the 4 machines was the IB network which are connected by each machines IB spine switch and no changes were recently made, our theory was that it had to either be a firmware issue on the IB switches, or a configuration issue (still within the IB network) which finally caught up to us.
I wont get into the details here (again), just thought I'd share what happened, the workaround and the solution. As it turns out we were pretty accurate in our theory. When you have connected machines the spine switch should be configured as the subnet manager (and not the leaf switch), but there is also a firmware bug that arises when you have connected machines which causes a panic. See MOS bug ID 10165319 (unfortunately this is an internal bug). Since we were already at the latest firmware version (1.1.3-2), we will have to wait until 1.3.3-2 is available which is maybe another 2 weeks (reference MOS article "Database Machine and Exadata Storage Server 11g Release 2 (11.2) Supported Versions (Doc ID 888828.1)"). Version 1.3.3-1 would also solve our issue but that was pulled back by Oracle to fix a few more items. We've not made a decision yet but I suspect we will wait until 1.3.3-2 is GA instead of applying 1.3.3-1. In the interim, we will need to monitor the logs and reboot the spine switch before it crashes all the machines again. Oh, I think this will be fun times.
As I've mentioned we did get everything up and running (rebooted each switch, then the Storage Servers, then the DB nodes) so we did get to do another dry run migration (in record time), and without further incident.
Just a note
Things are different if you have only a single machine, two to three machines, and 4 or more machines connected (we fell into the later). For a single machine no issues, this will not happen. For two to three connected machines ensure the spine switches are set as subnet manager with priority 8 and leafs with priority 5, not set as the subnet managers. For 4 or more connected machines ensure the spines are set as the subnet manager (priority 8), and the leafs set to priority 5 with the subnet manager process disabled.
I've updated this post as upon re-reading I found too much was said concerning the incident and not the actual issue and root/case. Also, the new firmware (1.3.3-2) has been made available so you can get it on MOS (patch 12373676 which also requires 11891229). Note 888828.1 has been updated to reflect this new version as well.