Over the past year and a half, and especially in the past three months, I have been in complete RAC installation and upgrade mode. Inevitably, when such projects come along, problems requiring detailed and exhaustively tedious troubleshooting will come up. It's been my experience that the largest set of books, or even the internet, can contain the exact sequence of problems you may encounter. It's not that they aren't there, just not in the condition and set of circumstances that meet your situation.
Just recently, a colleague asked for assistance doing the simplest step in the upgrade process from RAC 18.104.22.168 to 22.214.171.124, the software installation. The following message was received before the guts of the installation could begin:
"[INS-40418] The installer has detected that Oracle Clusterware is not running on local node. oracle recommends that you upgrade Oracle Clusterware in a rolling manner."
Now, needless to say, the above was untrue. We verified, using:
crsctl stat res -t -init
ps -ef | grep ora (which shows oracle owned processes at the linux level )
ps -ef | grep grid (which shows grid owned processes in linux)
So the clusterware and associated processes WERE running on the local node at the time the installation was attempted. We understand this is necessary so the upgrade process can collect information for the configuration of the new version. We had just done several installs, had a 100+ page detailed procedure which we developed to assist and assure things were done consistently, and yet this condition had never occurred!
Luckily, this is a testing platform where a delay at least does not impact customers or revenue. Still, at times when even experienced professionals are exasperated and clueless, Oracle support must be contacted. This assures that the latest information on the software is available, and verifies we're using appropriate methods of installation and configuration. In order to get the issue resolved expediently, we entered an SR (Service Request) on the Oracle Support website (support.oracle.com) at level 2, business impact, but workarounds are available. The support analyst was prompt, but asked for an awful lot of information about the environment in a short period of time. Luckily, Oracle now has diagnostic tools such as rdalite and diagcollection.pl to collect the data quickly. An additional piece of luck - this was only a two-node RAC. We have installations at this site with up to 26 nodes! But still, deadlines are deadlines, and critical test by the QA team was dependent upon this environment.
After two days of inquisition by an Oracle analyst, including a live webcast and deep-dive into the entire RAC infrastructure, the problem was still unsolved, until, my colleague Jayanth was skimming through the original installation and configuration to collect more data for Oracle, and stumbled across the actual problem. The host file will not be published to protect the client's configuration security.
You had to really look for it, and neither the DBA in charge, this writer, the Oracle analyst or the various diagnostic scripts that Oracle recommends caught it, but there are TWO ENTRIES for ip xxx.xx.xx.xx in the hosts file. Even though the Oracle message from their installation program says nothing about it, that is the TRUE REASON why the installation could not precede.
All if this makes me wonder if diagnostic messages are ever of any use at all! They are often wrong, or simple misdirection, causing you to take even longer to do a relatively simple task. The good news is, after this issue was finally discovered, the upgrade and patching was done in only a few hours, which is a very short period of time in RACworld!
I've gone on a lot longer about this particular issue than I meant. I hope it someday helps someone else. My next post will involve more detail on how to engage support, and what to do while you're working with them, and what your backup plan should be. Good day!