Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

Posts Tagged ‘11g’

Oracle Open World 2011 – My presentation on RAC topic

Posted by Riyaj Shamsudeen on September 28, 2011

I will be talking about “Administering Parallel Execution in RAC” with demos on Sunday morning 9AM-10AM (session id 28060). This is part of IOUG RAC SIG presentation series. You would enjoy the content and demos I have prepared.

I know, it is too early, but hoping to see you there!
BTW, if you have attended my RAC Advanced Troubleshooting class series, please don’t hesitate to introduce yourself when we meet.

Session details:
Session ID: 28060
Session Title: IOUG: Administering Parallel Execution in Oracle RAC
Venue / Room: Moscone West- 2005
Date and Time: 10/2/11, 9:00 – 10:00

Update: I just completed this session in IOUG. Thank you for coming, if you were in the room.
You can download the pdf file from
PX execution in RAC

Posted in 11g, Performance tuning, Presentations, RAC | Tagged: , , , , | 4 Comments »

What’s in a voting disk?

Posted by Riyaj Shamsudeen on October 29, 2010

Introduction

In RAC, CSSD processes (Cluster Services Synchronization Daemon) monitor the health of RAC nodes employing two distinct heart beats: Network heart beat and Disk heart beat. Healthy nodes will have continuous network and disk heartbeats exchanged between the nodes. Break in heart beat indicates a possible error scenario. There are few different scenarios possible with missing heart beats:

  1. Network heart beat is successful, but disk heart beat is missed.
  2. Disk heart beat is successful, but network heart beat is missed.
  3. Both heart beats failed.

In addition, with numerous nodes, there are other possible scenarios too. Few possible scenarios:

  1. Nodes have split in to N sets of nodes, communicating within the set, but not with members in other set.
  2. Just one node is unhealthy.

Nodes with quorum will maintain active membership of the cluster and other node(s) will be fenced/rebooted. I can’t discuss all possible scenarios in a blog entry, so we will discuss a simplistic 2-node single voting disk alone here.
Voting disks are used to monitor the disk heart beats. It is preferable to have at least 3 voting disks or odd number of voting disks greater than or equal to 3.

CSSD is a multi-threaded process

Voting disks are shared between the nodes and should be visible from all nodes, stating the obvious. CSSD process is a multi-threaded process and a thread of the CSSD process monitors the disk heart beat. The disk HB (Heart Beat) thread is scheduled approximately every second and that thread verifies the disk heart beat from all active nodes in the cluster. Also, another thread of CSSD monitors the network heart beat. Pstack (Solaris) of CSSD process would show the threads of CSSD process.

Details: write calls

CSSD process in each RAC node maintains it heart beat in a block of size 1 OS block, in the voting disk. In Solaris VM that I was testing, OS block size is 512 bytes (We will discuss just Solaris alone in this post). In addition to maintaining its own disk block, CSSD processes also monitors the disk blocks maintained by the CSSD processes running in other cluster nodes.

CSSD process writes a 512 block to the voting disk in a specific offset. The written block has a header area with the node name. For example, in the pwrite call below, node name of solrac1 is in the first few bytes of the block. Third line printed below is keeping track of heart beat counter. This heart beat counter looks similar to the SCN mechanism, “0F 9D02″ is the sequence number for the first write.

Also, Notice that the offset for the pwrite call 0×04002400. Node solrac1 writes a 512 byte block starting at the offset 0×04002400.

"/14:	pwrite(256, 0x019F4A00, 512, 0x04002400)	= 512"
"/14:	   e t o V02\ 0\ 0\ 00104\ v02\ 0\ 0\ 0\ 0 s o l r a c 1\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0 c JD2\ n 0F9D02 \ 003\ 0\ 0\ 0"  <-- Counter
"/14:	  \ 00303\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  F0 xBE L e01EC\ 0 e ;\ 0\ 0\ 0\ 0\ 0\ 003\ 0\ 0\ 0 { =BE L1C87A8 L\ 0\ 001\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"

Next write call from the local node CSSD process shows an increase in the counter. Value of that counter in line 3 went up from “0F 9D02″ -> “10 9D02″. BTW, I am removing few lines printed with to improve readability. So, the counter is incremented for every heart beat.

"/14:	pwrite(256, 0x019F4A00, 512, 0x04002400)	= 512"
"/14:	   e t o V02\ 0\ 0\ 00104\ v02\ 0\ 0\ 0\ 0 s o l r a c 1\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0 c JD2\ n109D02\ 003\ 0\ 0\ 0" <-- counter
"/14:	  \ 00303\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
...
"/14:	  F1 xBE L N05EC\ 0 f ;\ 0\ 0\ 0\ 0\ 0\ 003\ 0\ 0\ 0 { =BE L1C87A8 L\ 0\ 001\ 0"
...
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"

Details: read calls

After the successful write, CSSD process also reads the blocks maintained by CSSD processes from other nodes. For example, in the pread output below, CSSD process in the solrac1 node is reading the block maintained by CSSD@solrac2 node using a pread call. Third line in the listing below has a sequence value of “FB 9702″ for the node solrac2. A different sequence number is used by each node.

Also, Notice the offset for the pread call is 0×04002200 and that offset is different between pread and pwrite calls. Essentially, Node solrac2 is writing its heart beat starting at offset 0×04002200 and solrac1 is writing its heart beat at an offset of 0×04002400. The difference between these two offset values are exactly 0×200, which is 512 bytes.

In a nutshell, node Solrac2 maintains the heart beat disk block at an offset of 0×04002200 and the node solrac1 maintains the heart beat in the next 512 byte disk block.

"/14:	pread(256, 0x019F5000, 512, 0x04002200)		= 512"
"/14:	   e t o V01\ 0\ 0\ 00104\ v02\ 0\ 0\ 0\ 0 s o l r a c 2\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0 c JD2\ nFB9702\ 003\ 0\ 0\ 0" <-- counter
"/14:	  \ 00303\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  F0 xBE L KD6E9\ 0 /\ t\ 0\ 0\ 0\ 0\ 0\ 003\ 0\ 0\ 0 m oBE L1C87A8 L\ 0\ 001\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"

Next read from the offset 0×04002200 shows that solrac2 also increased the counter from “FB 9702″ to “FC 9702″.

"/14:	pread(256, 0x019F5000, 512, 0x04002200)		= 512"
"/14:	   e t o V01\ 0\ 0\ 00104\ v02\ 0\ 0\ 0\ 0 s o l r a c 2\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0 c JD2\ nFC9702\ 003\ 0\ 0\ 0" <-- counter
"/14:	  \ 00303\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
...
"/14:	  F1 xBE L 5DAE9\ 0 0\ t\ 0\ 0\ 0\ 0\ 0\ 003\ 0\ 0\ 0 m oBE L1C87A8 L\ 0\ 001\ 0"
...
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"

Summary

In essence, disk heart beat is maintained in the voting disk by the CSSD processes. If the disk block is not updated in a short timeout period, that node is considered unhealthy and may be rebooted depending upon quorum of that node(or Shot in the head) to avoid split brain situation.

As evidenced in this blog, there isn’t really any useful data kept in the voting disk. So, if you lose voting disks, you can simply add them back without losing any data. But, of course, losing voting disks can lead to node reboots. If you lose all voting disks, then you will have to keep the CRS daemons down, then only you can add the voting disks.

This blog entry also begs the question about performance. How many I/O calls are performed against these voting disks? As the number of nodes increases, I/O also increases. For 2 node RAC, there are 2 reads (CSSD also reads another block, not sure why though) and 2 writes per second. With 6 nodes in the cluster, it will be 35 reads and 6 writes per second. From 11g onwards, you could keep voting disks in ASM.

If you wish to read this blog entry in more conventional format read orainternals_voting_disks_internals.pdf

Posted in 11g, Oracle database internals, Performance tuning, RAC | Tagged: , , , , , , , | 10 Comments »

COLLABORATE 2009 presentation: 11g performance new feature.

Posted by Riyaj Shamsudeen on May 6, 2009

I just presented about Oracle 11g new features specific to performance in COLLABORATE 2009, Orlando Florida. You can download presentation and paper from here:

11g performance specific new features – presentation
11g performance specific new features – paper

I met couple of new friends and many familiar faces.

Catherin Devlin introduced sqlpython tool to me. This tool is an user friendly replacement for sqlplus with many UNIX like goodies. This tool will be an useful addition to command line tools. You can see installation instructions for sqlpython here.

I also attended an interesting application performance panel discussion with fellow Oakie Mark W Farnheim, Mike Brown and Sandra Vucinic. We discussed few customer issues.

An interesting problem discussion is worth mentioning. A client clones production database to development database using full rapidclone methodology. But, access plans for a SQL statement is different between production and development, even though everything was exactly the same. I am hoping, that client will send 10053 trace files from both databases and I will blog about it if I get anything from that client. I think, bind peeking is causing the plan difference, but need to analyze 10053 trace file to confirm that. Another possibility is that, may be , he is not cloning Oracle Software and some software patches are not existing in cloned environment. Need trace files to proceed further here.

Of course, I met few other friends Jeremy Schneider, Dan Norris to mention few.

Posted in 11g, CBO, EBS11i, Oracle database internals, Performance tuning, Presentations | Tagged: , , , , , , , , , , | Leave a Comment »

Lock table followed by DDL?

Posted by Riyaj Shamsudeen on January 8, 2009

My client DBA wrote a small script: To drop a column to a busy table. This application is mostly an OLTP application and table is modified heavily. It is hard to get a lock on the table, even though DDL to drop the column will be quick completing within sub-seconds. Following script was tried in the middle of night so as to minimize production impact.

set serveroutput on size 100000
declare
begin
      -- lock table in exclusive mode 
      execute immediate 'lock table t1 in exclusive mode';
      dbms_output.put_line ('**** Table locked. Dropping column ****' ) ;
      -- If we lock it, drop the column
      execute immediate 'alter table t1 drop column n5 ';
exception
  when others then
     dbms_output.put_line(' No luck ');
     dbms_output.put_line(' SQL code'||sqlcode||','|| 'error '||sqlerrm );
end;
/

Logic of the above script is a) wait for 5 minutes to lock the table in exclusive mode, before timing out. b) If we acquire exclusive lock on that table, then add a column.

Do you see any issues with this script?

So, did we add that column ?
Keep Reading

Posted in 11g, Performance tuning | Tagged: , , , , , , | 15 Comments »

 
Follow

Get every new post delivered to your Inbox.

Join 215 other followers