Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

Clusterware Startup

Posted by Riyaj Shamsudeen on June 5, 2013

The restart of a UNIX server call initialization scripts to start processes and daemons. Every platform has a unique directory structure and follows a method to implement server startup sequence. In Linux platform (prior to Linux 6), initialization scripts are started by calling scripts in the /etc/rcX.d directories, where X denotes the run level of the UNIX server. Typically, Clusterware is started at run level 3. For example, ohasd daemon started by /etc/rc3.d/S96ohasd file by supplying start as an argument. File S96ohasd is linked to /etc/init.d/ohasd.

S96ohasd -> /etc/init.d/ohasd

/etc/rc3.d/S96ohasd start  # init daemon starting ohasd.

Similarly, a server shutdown will call scripts in rcX.d directories, for example, ohasd is shut down by calling K15ohasd script:

K15ohasd -> /etc/init.d/ohasd
/etc/rc3.d/K15ohasd stop  #UNIX daemons stopping ohasd

In Summary, server startup will call files matching the pattern of S* in the /etc/rcX.d directories. Calling sequence of the scripts is in the lexical order of script name. For example, S10cscape will be called prior to S96ohasd, as the script S10cscape occurs earlier in the lexical sequence.

Google if you want to learn further about RC startup sequence. Of course, Linux 6 introduces Upstart feature and the mechanism is a little different: http://en.wikipedia.org/wiki/Upstart

That’s not the whole story!

Have you ever thought why the ‘crsctl start crs’ returns immediately? You can guess that Clusterware is started in the background as the command returns to UNIX prompt almost immediately. Executing the crsctl command just modifies the ohasdrun file content to ‘restart’. It doesn’t actually perform the task of starting the clusterware. Daemon init.ohasd reads the ohasdrun file every few seconds and starts the Clusterware if the file content is changed to ‘restart’.

# cat /etc/oracle/scls_scr/oel6rac1/root/ohasdrun
restart

If you stop has using ‘crsctl stop has’ , then the ohasdstr file content is modified to stop and so, init.ohasd daemon will not restart Clusterware. However, stop command is synchronous and executes the stop of clusterware too.

# crsctl stop has
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'oel6rac1'
CRS-2673: Attempting to stop 'ora.crsd' on 'oel6rac1'
..

The content of ohasdrun is modified to stop:

# cat  /etc/oracle/scls_scr/oel6rac1/root/ohasdrun
stop # 

In a nutshell, init.ohasd daemon is monitoring the ohasdrun file and starts the Clusterware stack if the value in the file is modified to restart.

Inittab

Init.ohasd daemon is an essential daemon for Clusterware startup. Even if the Clusterware is not running on a node, you can start the Clusterware from a different node. How does that work? Init.ohasd is the reason.

The init.ohasd daemon is started from /etc/inittab. Entries in the inittab is monitored by the init daemon (pid=1) and init daemon will react if the inittab file is modified. The init daemon monitors all processes listed in the inittab file and reacts according to the configuration in the inittab file. For example, if init.ohasd fails for some reason, it is immediately restarted by init daemon.

Following is an example entry in the inittab file. Fields are separated a colon, second field indicates that init.ohasd will be started in run level 3, and the third field indicates an action field. Restart in the action field means that, if the target process exist, just continue scanning inittab file; if the target process does not exist, then restart the process.

#cat /etc/inittab
…
h1:3:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

If you issue a clusterware startup command from a remote node, that a message sent to init.ohasd daemon in the target node, and the daemon initates the clusterware startup. So, init.ohasd will be always running irrespective of whether the Clusterware is running or not.

You can use strace on init.ohasd to verify this behavior. Following are a few relevant lines from the output of strace command of init.ohasd process:

…
5641  1369083862.828494 open("/etc/oracle/scls_scr/oel6rac1/root/ohasdrun", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
5641  1369083862.828581 dup2(3, 1)      = 1
5641  1369083862.828606 close(3)        = 0
5641  1369083862.828631 execve("/bin/echo", ["/bin/echo", "restart"], [/* 12 vars */]) = 0
…

Just for fun!

So, what happens if I manually modify that ohasdrun to restart? I copied the ohasdrun to a temporary file (/tmp/a1.lst) and stopped the clusterware.

cp /etc/oracle/scls_scr/oel6rac1/root/ohasdrun /tmp/a1.lst 
# crsctl stop has

I verified that Clusterware is completely stopped. Now, I will copy the file again overlaying ohasdrun:

# cat /tmp/a1.lst
restart
# cp /tmp/a1.lst  /etc/oracle/scls_scr/oel6rac1/root/ohasdrun

After a minute or so, I see that Clusterware processes are started. Not that, you would use this type of hack in a Production cluster, but this test proves my point.

It’s also important not to remove the files in the scls_scr directories. Any removal of the files underneath the scls_scr directory structure can lead to an invalid configuration.

There are also two more files in the scls_scr directory structure. Ohasdstr file decides if the HAS daemon should be started automatically or not. For example, if you execute ‘crsctl disable has’, that command modifies ohasdstr file contents to ‘disable’. Similarly, crsstart file controls CRS daemon startup. Again, you should use recommended commands to control the startup, rather than modifying any of these files directly.

11.2.0.1 and HugePages

If you tried to configure hugepages in 11.2.0.1 clusterware, by increasing memlock kernel parameter for GRID and database owner, you would have realized that database doesn’t use hugepages if started by the clusterware. Database startup using sqlplus will use hugepages, but the database startup using srvctl may not use hugepages.

As the new processes are cloned from the init.ohasd daemon, until init.ohasd is restarted, user level memlock limit changes are not correctly reflected in an already running process. Only recommended way to resolve the problem is to restart the node completely (not just the clusterware), as init.ohasd daemon must be restarted to reflect the user level limits.

Version 11.2.0.2 fixes this issue by explicitly calling ulimit command from /etc/init.d/ohasd files.


Summary

In Summary, init.ohasd process is an important process. Files underneath scls_scr directory is controlling the startup behavior. This also means that if a server is restarted, you don’t need to explicitly stop the Clusterware. You can let the server startup to restart the Clusterware.

PS: Some of you have asked my why my blogging frequency has decreased. I have been extremely busy co-authoring a book on RAC titled “Expert RAC Practices 12c”. We are covering lots of interesting stuff in the book. As soon as 12c release is Production, we can release the book also.

16 Responses to “Clusterware Startup”

  1. Amos said

    very good, thanks!

  2. Mike said

    Awesome info. Much appreciated!

  3. NT said

    Good Stuff, very useful info.

  4. satya said

    Very Good Stuff .. Keep posting … Thanks a lot …

  5. Mohammed said

    Thanks for sharing the details…

    One small question out of topic related to VIP (virtual IP in RAC)

    If node 1 failed then VIP resources/services will move/shifted to Node 2. Very true…

    As per theory all old connection/sessions which are connected to VIP1 will be discarded from node 2 by RE-ARP Protocol broadcast message.

    Then how oracle handle basic + session failed over property..

    For Example:

    If I do select * from dba_source ..running on node 1…… & node 1 failed for some unexpected reason then same session will move to node 2 and continue select operation without any interruption. (How Oracle handle this request.. ?)

    Please request to shed some light on this.

    Thanks

    Yousuf,

    • Mohammed
      TAF is handled mostly by the client side JDBC/OCI driver. State of the connection, including cursor position, is remembered by the driver. Upon session failures, the driver requests a new connection to the listener process, replays the SQL statement, and returns rows from the cursor position.

  6. ash said

    It was good to know. Do you happen to know the process of starting 11gr2 crs/asm as active while 12c cluster passive or standby until you decide to start 12c cluster and stop 11gr2. On same machine?

    • Hello
      Thanks for reading. So, you are trying to install 12c clusterware, but not start using that 12c version until later time. Do I understand your question correctly?

      If so, you can install the software only, and not run the rootupgrade.sh script. That should keep the cluster in 11gR2 until you are ready to switch over. However, considering that it doesn’t take too long to install 12c GI software, I wouldn’t recommend taking that unnecessary risk, however, minimal it is.

      Cheers
      Riyaj

      • ash said

        Hi
        thank you. Yes but configuaring too.
        Basically I have 11gr2 GI AND WOULD LIKE TO INSTALL and configure 12CR1 GI ON THE SAME CLUSTER (2 NODE). THEN I CAN SWITCH AND SWITCHBACK OF VERSION, 11.2 TO 12.1 AND LIKE VERSA. Therefore, what are the common cluster files that can be move before 12.1 GI INSTALL, configure before overnight on them current configuration; let says 11.2. I have identify those files, at least 10, but not sure what will be the conflict can occur when stopping 12c cluster and starting 11gr2 cluster on same cluster using same owner and network. Better practices to have new set up on Ssame GI EXCLUDIND host name and IP. Anyway, I was trying to find a clue from your latest book “expert oracle 12c RAC.” Provided files name to be move after stop and before start crs based on current version to either higher or lower clusterware:
        inittab oratab ohasd init.ohasd ocr.loc olr.loc /etc/rc3d/S9ohasd K16ohasd plus some files from /var/tmp/.oracle. let me know if I am missing any. So far my cluster verification successful and will install 12cr1 GI tomorrow. Additional, I have created ASM LUN for the ocr and vote disk based on higher version to be read. At least avoiding ocr/vote disk conflict for starting up crs.
        Thank you

      • I can’t recollect any other files that you need to backup. As long as, voting disk and OCR is different devices, you may be able to do this. However, I have not done this exercise personally, and so, I can’t confirm it will work.

        I will be interested to know how this works out. Thanks..

  7. ash said

    Hi
    Thank you. Yes!! it is a possible to do it and will provide a steps ” how to do it.” After backing of listed files, was able to install and config the 121 GI. So, I am able to swap 11.2 GI to 12.1 GI, no issue reported. For ocr and vote disk, have created separate disk group on 121; therefore, no conflict between 112 and 121 ocr vote disk. One thing I noticed, the diskgroup that was created on 112 was dismounted and expected on 121 cluster/ASM but their asm compatibility is 0.0.0.0.0 on version 12.1.0.1. Not sure about this directort, /var/tmp/.oracle, would be a conflict when I will swap 121 to 112 since duting installation of clustereware, had received warning and had to move to other location. /usr/local/bin also have to back it up before it over write by root.sh on both occasion GI and RDBMS since they are on common location. I am may not be worry to much since this is just the setting up the env.
    So it seems that whatever files have moved, it turn out to be correct conflict files which brought up the 121 GI without any issue including RDBMS. I will follow up tomorrow for 121 to 112 swapping.

  8. ash said

    Hi
    so I was able to conduct a swap release successful. No issue while bring the crs one at a time. The listed files are correct to make it happen. Steps are simple: shut down the cluster/crs. Move those files in secur location. Allocate new disk and use ASMLIB to label the LUN. Crate now mount point for new version. I used same user id for 12c install and conf. Start install and config 12c clusterware. Then install and config db including PDB. Do some sanity check.
    Shutdown the new release and backup those files.
    Overwrite those files using 11gr2 bkup files top of 12cr1 files and Setup 11gr2 env. Startup the crs of 11gr2 version. And you can do same step from 12cr1 to 11gr2.
    thank you.
    -ashish

  9. Ash said

    Thank you. Yes, these was my another best effort on 12c setup. First deployment about the plugin PDB from (noc-cdb and from CDB) using RMAN active feature having complex envs setup including Data Guard and Golden Gate. As reader, i always look deep into, page by page word by word, in any related articles and books for new concept to accomplish given tasks. I wish i can share those 50 steps; but can’t upload or attach the doc since am working on Finance env. Again, Thank you and will continue reading your book.
    -Ashish

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 202 other followers

%d bloggers like this: