Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

Archive for the ‘Oracle database internals’ Category

Video: deep review of LMS

Posted by Riyaj Shamsudeen on January 20, 2012

This video was created circa July 2011. Click the Read More link to review the video. Version Oracle Database 11.2.0.2

Synopsis: Essentially, we probe the importance of LMS processes using DTrace. Explain why LMS should run in elevated priority. How to review deep statistics about LMS processes and much more.

Posted in Oracle database internals, Performance tuning, Presentations, RAC, video | Tagged: , , , , | 7 Comments »

SCN – What, why, and how?

Posted by Riyaj Shamsudeen on January 19, 2012

In this blog entry, we will explore the wonderful world of SCNs and how Oracle database uses SCN internally. We will also explore few new bugs and clarify few misconceptions about SCN itself.

What is SCN?

SCN (System Change Number) is a primary mechanism to maintain data consistency in Oracle database. SCN is used primarily in the following areas, of course, this is not a complete list:

  1. Every redo record has an SCN version of the redo record in the redo header (and redo records can have non-unique SCN). Given redo records from two threads (as in the case of RAC), Recovery will order them in SCN order, essentially maintaining a strict sequential order. As explained in my paper, every redo record has multiple change vectors too.
  2. Every data block also has block SCN (aka block version). In addition to that, a change vector in a redo record also has expected block SCN. This means that a change vector can be applied to one and only version of the block. Code checks if the target SCN in a change vector is matching with the block SCN before applying the redo record. If there is a mismatch, corruption errors are thrown.
  3. Read consistency also uses SCN. Every query has query environment which includes an SCN at the start of the query. A session can see the transactional changes only if that transaction commit SCN is lower then the query environment SCN.
  4. Commit. Every commit will generate SCN, aka commit SCN, that marks a transaction boundary. Group commits are possible too.

SCN format

SCN is a huge number with two components to it: Base and wrap. Wrap is a 16 bit number and base is a 32 bit number. It is of the format wrap.base. When the base exceeds 4 billion, then the wrap is incremented by 1. Essentially, wrap counts the number of  times base wrapped around 4 billion. Few simple SQL script will enumerate this better:
But wait, there’s more!

Posted in 11g, corruption, Oracle database internals, Performance tuning, RAC, recovery | Tagged: , , , , , , , | 28 Comments »

gc cr disk read

Posted by Riyaj Shamsudeen on January 13, 2012

You might encounter RAC wait event ‘gc cr disk read’ in 11.2 while tuning your applications in RAC environment. Let’s probe this wait event to understand why a session would wait for this wait event.

Understanding the wait event

Let’s say that a foreground process running in node 1, is trying to access a block using a SELECT statement and that block is not in the local cache. To maintain the read consistency, foreground process will require the block consistent with the query SCN. Then the sequence of operation is(simplified):

  1. Foreground session calculates the master node of the block; Requests a LMS process running in the master node to access the block.
  2. Let’s assume that block is resident in the master node’s buffer cache. If the block is in a consistent state (meaning block version SCN is lower (or equal?) to query SCN), then LMS process can send the block to the foreground process immediately. Life is not that simple, so, let’s assume that requested block has an uncommitted transaction.
  3. Since the block has uncommitted changes, LMS process can not send the block immediately. LMS process must create a CR (Consistent Read) version of the block: clones the buffer, applies undo records to the cloned buffer rolling back the block to the SCN consistent with the requested query SCN.
  4. Then the CR block is sent to the foreground process.

LMS is a light weight process

Global cache operations must complete quickly, in the order of milli-seconds, to maintain the overall performance of RAC database. LMS is a critical process and does not do heavy lifting tasks such as disk I/O etc. If LMS process has to initiate I/O, instead of initiating I/O, LMS will downgrade the block mode and send the block to the requesting foreground process (this is known as Light Works rule). Foreground process will apply undo records to the block to construct CR version of the block.

Read more

Posted in 11g, Oracle database internals, Performance tuning, Presentations, RAC | Tagged: , | 6 Comments »

RMOUG 2012 – Hello Denver!

Posted by Riyaj Shamsudeen on January 10, 2012

On February 14-16, I’ll be at the Colorado Convention Center in Denver, Colorado for RMOUG’s Training Days Conference. This is the largest regional Oracle User Conference in North America and attracts presenters from all around the country and the globe. I’ll be presenting:

Presentation Name: Troubleshooting RAC Background Process

Abstract: RAC background process performance is critical to keep the application performance. This session will demo techniques to review the performance of RAC background processes such as LMS, LMD, LMON, etc. using various statistics and UNIX tools. The presentation will also discuss why certain background processes must run in higher priority to maintain the application performance in RAC.

Presentation Name: A Kind and Gentle Introduction to RAC

Abstract: This session will introduce basic concepts such as cache fusion, conversion to RAC, protocols for interconnect, general architectural overview, GES layer locks, clusterware, etc. The session will also discuss the srvctl command and demo a few of these commands to improve the understanding.

Presentation Name: Parallel Execution in RAC

Abstract: This presentation will start to discuss and demo parallel server allocation, intra, and inter node parallelism aspects. The session will discuss the new parallelism features such as parallel statement queuing, parallel auto dop, and discuss the interaction of those features with RAC. The session will probe a few critical parameters to improve PQ performance in RAC.

Click here for more information or to register for RMOUG’s Training Days.

Posted in Oracle database internals, Performance tuning, Presentations, RAC | Tagged: , , | Leave a Comment »

Troubleshooting ‘DFS lock handle’ waits

Posted by Riyaj Shamsudeen on November 8, 2011

Waits for ‘DFS lock handle’ can cause massive performance issues in a busy RAC cluster. In this blog entry, we will explore the DFS lock handle wait event, and understand how to troubleshoot the root cause of these waits. I am also going to use locks and resources interchangeably in this blog, but internally, they are two different types of structures.

A little background

DFS (stands for Distributed File System) is an ancient name, associated with cluster file system operations, in a Lock manager supplied by vendors in Oracle Parallel Server Environment (prior name for RAC). But, this wait event has morphed and is now associated with waits irrelevant to database files also. Hence, it is imperative to understand the underlying details to debug the ‘DFS lock handle’ waits.

How does it work?

I have no access to the code, so read this paragraph with caution, as I may have misunderstood my test results: A process trying to acquire a lock on a global GES resource sends a AST(Asynchronous Trap) or BAST (Blocking Asynchronous Trap) message to LCK process, constructing the message with (lock pointer, resource pointer, and resource name) information. If the resource is not available, then the LCK process sends a message to the lock holder for a lock downgrade.

Read more

Posted in 11g, Oracle database internals, Performance tuning, RAC | Tagged: , , , , , , , , , , , | 10 Comments »

Advanced RAC Training

Posted by Riyaj Shamsudeen on July 5, 2011

I will be delivering an intense, advanced 2-week RAC training seminar in Aug 22-26 and Sep 19-23 with numerous demos, dumps and scripts. We will meet 4 hours per day, in a virtual world, 8AM-12 Noon Pacific time for those two weeks. Tanel has written has a great blog post about this seminar series Advanced RAC training.

If you like my presentations about RAC and performance tuning, you will love my seminar series. You will gain better understanding about RAC internals, and you will be able to advance your debugging and performance tuning skills attending my seminar series.

Join me and Let’s explore RAC further.

Posted in Oracle database internals, Performance tuning, Presentations, RAC | Tagged: , , , , | 5 Comments »

Does an UPDATE statement modify the row if the update modifies the column to same value?

Posted by Riyaj Shamsudeen on November 4, 2010

Introduction

If a table column is updated with the same value in a row, does Oracle RDBMS engine modify the data? (or) Does RDBMS engine have an optimization skipping the update, as value of that column is not changing? This was the essence of a question asked in Oracle-l list and I think, it is a good topic for further discussion. Jared Still came up with a fine method to understand this issue measuring redo/undo size. We will explore the same questions with redo log dump method in this blog entry.

Following few lines shows a test case creating a table, an index, and then populating a row in the table.

create table updtest (v1 varchar2(30));

create index updtest_i1 on updtest(v1);

insert into updtest values ('Riyaj');

commit;

REDO records and change vectors

If a row is modified by a SQL statement, Oracle Database generates redo records describing that change. Generated redo records are applied to the database blocks taking the blocks to next version. Direct mode inserts aka nologging inserts are not discussed in this blog entry.

Changes made by the SQL statement might be rolled back too. So, undo records are created in the undo block and these undo records describe how to rollback the SQL changes. Redo records are generated describing the changes to those undo blocks too. Further, these redo records are applied to the data blocks and undo blocks taking the blocks to next version. So, reviewing the redo records generated is sufficient to understand exactly what happens underneath.

To keep the redo log file dumps as clean as possible, we need a database with no activity. I have one such database and I have disabled all automatic jobs to make the redo dumps as clean as possible. It is also important to dump the log file from another session to maintain clarity.

So, in the following script, we will perform a checkpoint, switch log file from session #1. Then, We update the row with the same value from another session and commit. From Session #1, we switch log again from session #1.

Test case #1: Updating column to the same value
============
Session #1:
 alter system checkpoint;
 alter system switch logfile;
Session #2:
 update updtest set v1='Riyaj';
 commit;
Session #1: alter system switch logfile;

Essentially, Last redo log file has the redo records generated by executing the update statement. We need to dump the redo log file to see contents of the log file.

Notice that I am using ‘Riyaj’ as the value to update, as I can search for ASCII representation of my first name in hex easily in the dump file. I mean, Come on, who would not know the ASCII representation of their first name in hex? Here is a way to get Hex values:

select dump('Riyaj',16) Hex from dual

HEX
----------------------------
Typ=96 Len=5: 52,69,79,61,6a

We will use following Script to dump the last redo log file using ‘alter system dump logfile’ syntax:

-------------------------------------------------------------------------------------------  Script : dump_last_log.sql
------------------------------------------------------------------------------------------- This script will dump the last log file.
--   If the log file is big with enormous activity, this might take much resource.
--
--
--  Author : Riyaj Shamsudeen
--  No implied or explicit warranty !
-----------------------------------------------------------------------------------------set serveroutput on size 1000000
declare
  v_sqltext varchar2(255);
begin
 select 'alter system dump logfile '||chr(39)||member||chr(39)  into v_sqltext
  from v$log  lg, v$logfile lgfile
  where lg.group# = lgfile.group# and
  lg.sequence# = (select sequence#-1 from v$log where status='CURRENT' )
  and rownum <2
 ;
  dbms_output.put_line ('Executing :'||v_sqltext);
  execute immediate v_sqltext;
end;
/

Analysis of redo records:

Dumping the redo log file creates a trace file in the user_dump_dest destination. Let’s review the redo records associated with this statement.
Please refer to the output below and I am showing only relevant details. A REDO RECORD contains multiple change vectors and each change vector describe an atomic change to a block. First change vector is for OBJ: 77534 and that object_id is associated with the table UPDTEST. CHANGE #1 specifies how to Update Row Piece (URP): Update the column 0 to value of ’52 69 79 61 6a’ which is ‘Riyaj’ in the data block with Data Block Address (DBA) 0x0100113d associated with the object 77534.

REDO RECORD - Thread:1 RBA: 0x000071.00000002.0010 LEN: 0x021c VLD: 0x0d
SCN: 0x0000.0032b4c9 SUBSCN:  1 11/03/2010 10:51:02

CHANGE #1 TYP:2 CLS:1 AFN:4 DBA:0x0100113d OBJ:77534 SCN:0x0000.0032b40b SEQ:2 OP:11.5 ENC:0 RBL:0
KTB Redo
op: 0x11  ver: 0x01
..
KDO Op code: URP row dependencies Disabled
...
tabn: 0 slot: 0(0x0) flag: 0x2c lock: 1 ckix: 0
ncol: 1 nnew: 1 size: 0
Vector content:
col  0: [ 5]  52 69 79 61 6a <--- 'Riyaj'

Change Vectors 2 and 3 are irrelevant for our discussion, we will ignore it. CHANGE vector #4 is modifying the block with DBA 0x00c00694 which is for the object with object_id 4294967295. Objects with object_id close to 4GB are for undo segments. This change vector holds an undo record. That undo record describes how to rollback the change: Update the row at slot 0 column 0 in the block with DBA 0x0100113d to ’52 69 79 61 6a’.

CHANGE #4 TYP:0 CLS:18 AFN:3 DBA:0x00c00694 OBJ:4294967295 SCN:0x0000.0032b4a6 SEQ:1 OP:5.1 ENC:0 RBL:0
...
Undo type:  Regular undo        Begin trans    Last buffer split:  No
...
KDO undo record:
KTB Redo
op: 0x04  ver: 0x01
compat bit: 4 (post-11) padding: 0
op: L  itl: xid:  0x000a.00d.000005ae uba: 0x00c00996.01a7.2c
                      flg: C---    lkc:  0     scn: 0x0000.0032b3f1
KDO Op code: URP row dependencies Disabled
  xtype: XAxtype KDO_KDOM2 flags: 0x00000080  bdba: 0x0100113d  hdba: 0x0100113a
itli: 1  ispac: 0  maxfr: 4858
tabn: 0 slot: 0(0x0) flag: 0x2c lock: 0 ckix: 0
ncol: 1 nnew: 1 size: 0
Vector content:
col  0: [ 5]  52 69 79 61 6a

In a nutshell, update to the table row with the same value, updated the row value from ‘Riyaj’ to ‘Riyaj’. Even though, supplied value and current row value is the same value, update to the row piece must happen. It is important as both external and internalized triggers need to fire correctly.

But, what happened to the index update?

We have an index on that column v1 and we updated that indexed column too. Did Oracle update the indexed column? NO. If the values are matching at the indexed column level, then the RDBMS code is not updating the index, a redo optimization feature. Only the table row piece is updated and the index is not updated.

Updating to a different value

To explain what happens at the index level, we need to consider a test case that updates the column value to a different value. This test case is similar to the Test case #1 except that we are updating the column value from ‘Riyaj’ to ‘RiyajS’.

Test case #2: Updating to a different value.
============
Session #1:
 alter system checkpoint;
 alter system switch logfile;
Session #2:
 update updtest set v1='RiyajS';
 commit;
Session #1: alter system switch logfile;

First Change vector is updating the table row piece to ‘RiyajS’.

CHANGE #1 TYP:2 CLS:1 AFN:4 DBA:0x0100113d OBJ:77534 SCN:0x0000.0032b4c9 SEQ:2 OP:11.5 ENC:0 RBL:0
...
tabn: 0 slot: 0(0x0) flag: 0x2c lock: 2 ckix: 0
ncol: 1 nnew: 1 size: 1
col  0: [ 6]  52 69 79 61 6a 53 <--RiyajS

CHANGE #3 below is updating the index leaf block. Update to an indexed column value results in delete(s) and insert(s) at the index level. Change #3 is deleting the leaf row and Change Vector #4 is inserting a leaf row.

CHANGE #3 TYP:0 CLS:1 AFN:4 DBA:0x01001153 OBJ:77535 SCN:0x0000.0032b567 SEQ:1 OP:10.4 ENC:0 RBL:0
index redo (kdxlde):  delete leaf row
KTB Redo
...
REDO: SINGLE / -- / --
itl: 2, sno: 0, row size 17

CHANGE vector #4 below is inserting a leaf row in the index leaf block with the key values:“06 52 69 79 61 6a 53 06 01 00 11 3d 00 00″.

06: is the length of key value
52 69 79 51 6a 53: ‘RiyajS’
01 00 11 3d 00 00: ROWID. (notice the dba of the block 0x0100113d). 
CHANGE #4 TYP:0 CLS:1 AFN:4 DBA:0x01001153 OBJ:77535 SCN:0x0000.0032b569 SEQ:1 OP:10.2 ENC:0 RBL:0
index redo (kdxlin):  insert leaf row
...
REDO: SINGLE / -- / --
itl: 2, sno: 1, row size 18
insert key: (14):  06 52 69 79 61 6a 53 06 01 00 11 3d 00 00

Undo change vectors

There are two more change vectors describing the undo block changes. CHANGE vector #6 specifies the undo record to rollback then change at the table block level. Basically, pre-image of the row piece is captured in here.

CHANGE #6 TYP:0 CLS:30 AFN:3 DBA:0x00c00832 OBJ:4294967295 SCN:0x0000.0032b4b1 SEQ:1 OP:5.1 ENC:0 RBL:0
...
KDO undo record:
KTB Redo
...
...
tabn: 0 slot: 0(0x0) flag: 0x2c lock: 0 ckix: 0
ncol: 1 nnew: 1 size: -1
col  0: [ 5]  52 69 79 61 6a

Change #7 and Change #8 specifies how to undo the changes at the index level. To undo the change at the index level, do a delete of current index entry and insert the older image of the index entry. Change #8 specifies to purge the leaf row with the key value ’06 52 69 79 61 6a 53 06 01 00 11 3d 00 00′ [ RiyajS + rowid combo].

CHANGE #7 TYP:0 CLS:30 AFN:3 DBA:0x00c00832 OBJ:4294967295 SCN:0x0000.0032b569 SEQ:1 OP:5.1 ENC:0 RBL:0
...
Undo type:  Regular undo       Undo type:  Last buffer split:  No
...
index undo for leaf key operations
...
(kdxlre): restore leaf row (clear leaf delete flags)
key ( 13):  05 52 69 79 61 6a 06 01 00 11 3d 00 00

Change #7 specifies to restore the leaf row with the key value ’05 52 69 79 61 6a 06 01 00 11 3d 00 00′ [Riyaj + rowid combo].

CHANGE #8 TYP:0 CLS:30 AFN:3 DBA:0x00c00832 OBJ:4294967295 SCN:0x0000.0032b569 SEQ:2 OP:5.1 ENC:0 RBL:0
...
Undo type:  Regular undo       Undo type:  Last buffer split:  No
...
index undo for leaf key operations
...
(kdxlpu): purge leaf row
key ( 14):  06 52 69 79 61 6a 53 06 01 00 11 3d 00 00

In a nutshell, updating an indexed column with a different column value, deletes current entry from the index leaf block and inserts an entry in the index leaf block with an updated column value. In addition, two change vectors are also added describing how to undo the change at the leaf block.

Wait, did RDBMS engine really delete from the index leaf block?

No, entries are not physically deleted from the index leaf block. Rather, entry with the old value is marked with a D flag and the new entry is added to the leaf block with updated value. This is visible dumping the index leaf block of the index UPDTEST_I1. Please review the lines from the leaf block dump shown below: Row #0 is for the value ‘Riyaj’ and Row #1 is for the value ‘RiyajS’. Row #0 is marked with a D flag indicating that entry as a deleted entry. Row #1 is the active current entry.

-- (Change history #2) Converting DBA 0x01001153 to decimal yields 16781651. 
-- You can use calculator for this conversion.
-- you can also use the following method:
-- select to_number('01001153','xxxxxxxx') from dual;

TO_NUMBER('01001153','XXXXXXXX
------------------------------
                      16781651

--converting from dba to file#, block#.
undef dba
select
dbms_utility.DATA_BLOCK_ADDRESS_FILE(&&dba)||','||
dbms_utility.DATA_BLOCK_ADDRESS_BLOCK(&&dba)
from dual
/

Enter value for dba: 16781651
4,4435

Alter system dump datafile 4 block min 4435 block max 4435;

Trace file contents:
===================
row#0[7959] flag: ---D--, lock: 2, len=29
col 0; len 5; (5):  52 69 79 61 6a
col 1; len 6; (6):  00 00 00 00 00 00
col 2; len 6; (6):  01 00 11 3d 00 07
col 3; len 6; (6):  c0 91 86 99 e8 05

row#1[7905] flag: ------, lock: 2, len=30
col 0; len 6; (6):  52 69 79 61 6a 53
col 1; len 6; (6):  00 00 00 00 00 00
col 2; len 6; (6):  01 00 11 3d 00 07
col 3; len 6; (6):  c0 91 86 99 e8 05

Summary

If you have skipped all these hex dumps, I don’t blame you . But, if you have followed along, right on! To summarize:

  1. Updating the column value to the same value modifies the table block. Row piece in the table block is physically modified. But, the corresponding index entry is not modified.
  2. Updating the column value to a different value modifies both table and index blocks. Updating an indexed column value results in a delete and insert of index entries in the index leaf block.
  3. Delete of an index entry does not delete the entry physically, rather marks the entry with a D flag. Of course, future inserts in to this block will reuse the entry at some point in time. So, you don’t necessarily lose that space permanently.

Of course, there are other scenarios such as index leaf block split, branch block split etc, not covered in this blog entry.
Thanks to Mark Bobak and Jared Still for reviewing this blog entry and providing valuable contributions.

You can read this blog in a more conventional format as update to column value_v2 .

Version: Oracle Database release 11.2.0.1
Change history #2: Fixed a typo with block number. Thanks to Sreejith Sreekantan for pointing out the error.

Posted in Oracle database internals, Performance tuning, recovery | Tagged: , , , , , , , , | 7 Comments »

What’s in a voting disk?

Posted by Riyaj Shamsudeen on October 29, 2010

Introduction

In RAC, CSSD processes (Cluster Services Synchronization Daemon) monitor the health of RAC nodes employing two distinct heart beats: Network heart beat and Disk heart beat. Healthy nodes will have continuous network and disk heartbeats exchanged between the nodes. Break in heart beat indicates a possible error scenario. There are few different scenarios possible with missing heart beats:

  1. Network heart beat is successful, but disk heart beat is missed.
  2. Disk heart beat is successful, but network heart beat is missed.
  3. Both heart beats failed.

In addition, with numerous nodes, there are other possible scenarios too. Few possible scenarios:

  1. Nodes have split in to N sets of nodes, communicating within the set, but not with members in other set.
  2. Just one node is unhealthy.

Nodes with quorum will maintain active membership of the cluster and other node(s) will be fenced/rebooted. I can’t discuss all possible scenarios in a blog entry, so we will discuss a simplistic 2-node single voting disk alone here.
Voting disks are used to monitor the disk heart beats. It is preferable to have at least 3 voting disks or odd number of voting disks greater than or equal to 3.

CSSD is a multi-threaded process

Voting disks are shared between the nodes and should be visible from all nodes, stating the obvious. CSSD process is a multi-threaded process and a thread of the CSSD process monitors the disk heart beat. The disk HB (Heart Beat) thread is scheduled approximately every second and that thread verifies the disk heart beat from all active nodes in the cluster. Also, another thread of CSSD monitors the network heart beat. Pstack (Solaris) of CSSD process would show the threads of CSSD process.

Details: write calls

CSSD process in each RAC node maintains it heart beat in a block of size 1 OS block, in the voting disk. In Solaris VM that I was testing, OS block size is 512 bytes (We will discuss just Solaris alone in this post). In addition to maintaining its own disk block, CSSD processes also monitors the disk blocks maintained by the CSSD processes running in other cluster nodes.

CSSD process writes a 512 block to the voting disk in a specific offset. The written block has a header area with the node name. For example, in the pwrite call below, node name of solrac1 is in the first few bytes of the block. Third line printed below is keeping track of heart beat counter. This heart beat counter looks similar to the SCN mechanism, “0F 9D02″ is the sequence number for the first write.

Also, Notice that the offset for the pwrite call 0×04002400. Node solrac1 writes a 512 byte block starting at the offset 0×04002400.

"/14:	pwrite(256, 0x019F4A00, 512, 0x04002400)	= 512"
"/14:	   e t o V02\ 0\ 0\ 00104\ v02\ 0\ 0\ 0\ 0 s o l r a c 1\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0 c JD2\ n 0F9D02 \ 003\ 0\ 0\ 0"  <-- Counter
"/14:	  \ 00303\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  F0 xBE L e01EC\ 0 e ;\ 0\ 0\ 0\ 0\ 0\ 003\ 0\ 0\ 0 { =BE L1C87A8 L\ 0\ 001\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"

Next write call from the local node CSSD process shows an increase in the counter. Value of that counter in line 3 went up from “0F 9D02″ -> “10 9D02″. BTW, I am removing few lines printed with to improve readability. So, the counter is incremented for every heart beat.

"/14:	pwrite(256, 0x019F4A00, 512, 0x04002400)	= 512"
"/14:	   e t o V02\ 0\ 0\ 00104\ v02\ 0\ 0\ 0\ 0 s o l r a c 1\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0 c JD2\ n109D02\ 003\ 0\ 0\ 0" <-- counter
"/14:	  \ 00303\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
...
"/14:	  F1 xBE L N05EC\ 0 f ;\ 0\ 0\ 0\ 0\ 0\ 003\ 0\ 0\ 0 { =BE L1C87A8 L\ 0\ 001\ 0"
...
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"

Details: read calls

After the successful write, CSSD process also reads the blocks maintained by CSSD processes from other nodes. For example, in the pread output below, CSSD process in the solrac1 node is reading the block maintained by CSSD@solrac2 node using a pread call. Third line in the listing below has a sequence value of “FB 9702″ for the node solrac2. A different sequence number is used by each node.

Also, Notice the offset for the pread call is 0×04002200 and that offset is different between pread and pwrite calls. Essentially, Node solrac2 is writing its heart beat starting at offset 0×04002200 and solrac1 is writing its heart beat at an offset of 0×04002400. The difference between these two offset values are exactly 0×200, which is 512 bytes.

In a nutshell, node Solrac2 maintains the heart beat disk block at an offset of 0×04002200 and the node solrac1 maintains the heart beat in the next 512 byte disk block.

"/14:	pread(256, 0x019F5000, 512, 0x04002200)		= 512"
"/14:	   e t o V01\ 0\ 0\ 00104\ v02\ 0\ 0\ 0\ 0 s o l r a c 2\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0 c JD2\ nFB9702\ 003\ 0\ 0\ 0" <-- counter
"/14:	  \ 00303\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  F0 xBE L KD6E9\ 0 /\ t\ 0\ 0\ 0\ 0\ 0\ 003\ 0\ 0\ 0 m oBE L1C87A8 L\ 0\ 001\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"

Next read from the offset 0×04002200 shows that solrac2 also increased the counter from “FB 9702″ to “FC 9702″.

"/14:	pread(256, 0x019F5000, 512, 0x04002200)		= 512"
"/14:	   e t o V01\ 0\ 0\ 00104\ v02\ 0\ 0\ 0\ 0 s o l r a c 2\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0 c JD2\ nFC9702\ 003\ 0\ 0\ 0" <-- counter
"/14:	  \ 00303\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
...
"/14:	  F1 xBE L 5DAE9\ 0 0\ t\ 0\ 0\ 0\ 0\ 0\ 003\ 0\ 0\ 0 m oBE L1C87A8 L\ 0\ 001\ 0"
...
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"

Summary

In essence, disk heart beat is maintained in the voting disk by the CSSD processes. If the disk block is not updated in a short timeout period, that node is considered unhealthy and may be rebooted depending upon quorum of that node(or Shot in the head) to avoid split brain situation.

As evidenced in this blog, there isn’t really any useful data kept in the voting disk. So, if you lose voting disks, you can simply add them back without losing any data. But, of course, losing voting disks can lead to node reboots. If you lose all voting disks, then you will have to keep the CRS daemons down, then only you can add the voting disks.

This blog entry also begs the question about performance. How many I/O calls are performed against these voting disks? As the number of nodes increases, I/O also increases. For 2 node RAC, there are 2 reads (CSSD also reads another block, not sure why though) and 2 writes per second. With 6 nodes in the cluster, it will be 35 reads and 6 writes per second. From 11g onwards, you could keep voting disks in ASM.

If you wish to read this blog entry in more conventional format read orainternals_voting_disks_internals.pdf

Posted in 11g, Oracle database internals, Performance tuning, RAC | Tagged: , , , , , , , | 10 Comments »

Group by Hash aggregation

Posted by Riyaj Shamsudeen on September 30, 2010

So, Here I was merrily enjoying OpenWorld 2010 presentations in SFO, I got a call from a client about a performance issue. Client recently upgraded from Version 9i to Version 10g in an E-Business environment. I had the privilege of consulting before the upgrade, so we setup the environment optimally, and upgrade itself was seamless. Client did not see much regression except One query: That query was running for hours in 10g compared to 15 minutes in 9i.

Review and Analysis

Reviewed the execution plan in the development database and I did not see any issues with the plan. Execution plan in development and production looked decent enough. I wasn’t able to reproduce the issue in the development database either. So, the client allowed me to trace the SQL statement in the production database. Since the size of data in few tables is different between production and development databases, we had to analyze the problem in production environment.

I had to collect as much data possible as the tracing was a one-time thing. I setup a small script to get process stack and process memory area of that Unix dedicated server process to collect more details, in addition to tracing the process with waits => true.

Execution plan from the production database printed below. [ Review the execution plan carefully, it is giving away the problem immediately.] One execution of this statement took 13,445 seconds and almost all of it spent in the CPU time. Why would the process consume 13,719 seconds of CPU time?. Same process completed in just 15 minutes in 9i, as confirmed by Statspack reports. [ As a side note, We collected enormous amount of performance data in 9i in the Production environment before upgrading to 10g, just so that we can quickly resolve any performance issues, and you should probably follow that guideline too]. That collection came handy and It is clear that SQL statement was completing in 15 minutes in 9i and took nearly 3.75 hours after upgrading the database to version 10g.

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch       10  13719.71   13445.94         27    5086407          0       99938
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       12  13719.71   13445.94         27    5086407          0       99938

     24   HASH GROUP BY (cr=4904031 pr=27 pw=0 time=13240600266 us)
     24    NESTED LOOPS OUTER (cr=4904031 pr=27 pw=0 time=136204709 us)
     24     NESTED LOOPS  (cr=4903935 pr=27 pw=0 time=133347961 us)
 489983      NESTED LOOPS  (cr=3432044 pr=27 pw=0 time=104239982 us)
 489983       NESTED LOOPS  (cr=2452078 pr=27 pw=0 time=91156653 us)
 489983        TABLE ACCESS BY INDEX ROWID HR_LOCATIONS_ALL (cr=1472112 pr=27 pw=0 time=70907109 us)
 489983         INDEX RANGE SCAN HR_LOCATIONS_UK2 (cr=981232 pr=0 pw=0 time=54338789 us)(object id 43397)
 489983        INDEX UNIQUE SCAN MTL_PARAMETERS_U1 (cr=979966 pr=0 pw=0 time=17972426 us)(object id 37657)
 489983       INDEX UNIQUE SCAN HR_ORGANIZATION_UNITS_PK (cr=979966 pr=0 pw=0 time=10876601 us)(object id 43498)
     24      INDEX RANGE SCAN UXPP_FA_LOCATIONS_N3 (cr=1471891 pr=0 pw=0 time=27325172 us)(object id 316461)
     24     TABLE ACCESS BY INDEX ROWID PER_ALL_PEOPLE_F (cr=96 pr=0 pw=0 time=2191 us)
     24      INDEX RANGE SCAN PER_PEOPLE_F_PK (cr=72 pr=0 pw=0 time=1543 us)(object id 44403)

pstack, pmap, and truss

Reviewing pstack output generated from the script shows many function calls kghfrempty, kghfrempty_ex, qerghFreeHashTable etc, implying hash table operations. Something to do with hash table consuming time?

 ( Only partial entries shown ) 
 0000000103f41528 kghfrempty 
 0000000103f466ec kghfrempty_ex 
 0000000103191f1c qerghFreeHashTable 
 000000010318e080 qerghFetch 
 00000001030b1b3c qerstFetch 
...
 0000000103f41558 kghfrempty 
 0000000103f466ec kghfrempty_ex 
 0000000103191f1c qerghFreeHashTable 
 000000010318e080 qerghFetch
 00000001030b1b3c qerstFetch

Truss of the process also showed quite a bit of mmap calls. So, the process is allocating more memory to an hash table?

...
mmap(0xFFFFFFFF231C0000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xFFFFFFFF231C0000
...
pollsys(0xFFFFFFFF7FFF7EC8, 1, 0xFFFFFFFF7FFF7E00, 0x00000000) = 0
mmap(0xFFFFFFFF231D0000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xFFFFFFFF231D0000
...

Execution plan again ..

Reviewing the execution plan again showed an interesting issue. I am going to post only two relevant lines from the execution plan below. As you can see that elapsed time at NESTED LOOPS OUTER step is 136 seconds. But the elapsed time at the next HASH GROUP BY step is 13240 seconds, meaning nearly 13,100 seconds spent in the HASH GROUP BY Step alone! Why would the process spend 13,100 seconds in a group by operation? Actual SQL execution took only 136 seconds, but the group by operation took 13,100 seconds. That doesn’t make sense, Does it?

     24   HASH GROUP BY (cr=4904031 pr=27 pw=0 time=13240600266 us)
     24    NESTED LOOPS OUTER (cr=4904031 pr=27 pw=0 time=136204709 us)
...

OFE = 9i

Knowing that time is spent in the Group by operation and that the 10g new feature Hash Grouping method is in use, I decided to test this SQL statement execution with 9i optimizer. The SQL completed in 908 seconds with OFE(optimizer_features_enabled) set to 9.2.0.8 (data is little bit different since production is an active environment). You can also see that SORT technique is used to group the data.

alter session set optimizer_features_enabled=9.2.0.8;

Explain plan :
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch   106985    887.41     908.25     282379    3344916        158     1604754
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   106987    887.41     908.25     282379    3344916        158     1604754

      4   SORT GROUP BY (cr=2863428 pr=0 pw=0 time=37934456 us)
      4    NESTED LOOPS OUTER (cr=2863428 pr=0 pw=0 time=34902519 us)
      4     NESTED LOOPS  (cr=2863412 pr=0 pw=0 time=34198726 us)
 286067      NESTED LOOPS  (cr=2003916 pr=0 pw=0 time=24285794 us)
 286067       NESTED LOOPS  (cr=1431782 pr=0 pw=0 time=19288024 us)
 286067        TABLE ACCESS BY INDEX ROWID HR_LOCATIONS_ALL (cr=859648 pr=0 pw=0 time=13568456 us)
 286067         INDEX RANGE SCAN HR_LOCATIONS_UK2 (cr=572969 pr=0 pw=0 time=9271380 us)(object id 43397)
 286067        INDEX UNIQUE SCAN MTL_PARAMETERS_U1 (cr=572134 pr=0 pw=0 time=4663154 us)(object id 37657)
...

Knowing the problem is in the GROUP BY step, we setup a profile with _gby_hash_aggregation_enabled set to FALSE, disabling the new 10g feature for that SQL statement. With the SQL profile, performance of the SQL statement is comparable to pre-upgrade timing.

This almost sounds like a bug! Bug 8223928 is matching with this stack, but it is the opposite. Well, client will work with the support to get a bug fix for this issue.

Summary

In summary, you can use scientific methods to debug performance issues. Revealing the details underneath, will enable you to come up with a workaround quickly, leading to a faster resolution.
Note that, I am not saying hash group by feature is bad. Rather, we seem to have encountered an unfortunate bug which caused performance issues at this client. I think, Hash Grouping is a good feature as the efficiency of grouping operations can be improved if you have ample amount of memory. That’s the reason why we disabled this feature at the statement level, NOT at the instance level.
This blog in a traditional format hash_group_by_orainternals

Update 1:

I am adding a script to capture pmap and pstack output in a loop for 1000 times, with 10 seconds interval. Tested in Oracle Solaris.

#! /bin/ksh
 pid=$1
 (( cnt=1000 ))
 while  [[ $cnt -gt 0 ]];
  do
        date
        pmap -x $pid
        pstack $pid
        echo $cnt
        (( cnt=cnt-1 ))
        sleep 10
  done

To call the script: assuming 7887 is the UNIX pid of the process.
nohup ./pmap_loop.ksh 7887 >> /tmp/a1.lst 2>>/tmp/a1.lst &

Syntax for the truss command is given below. Please remember, you can’t use pmap, pstack and truss concurrently. These commands stops the process (however short that may be!) and inspects them, so use these commands sparingly. [ I had a client who used to run truss on LGWR process on a continuous(!) basis and database used to crash randomly!]. I realize that pmap/pstack/truss can be scripted to work together, but that would involve submitting a background process for the truss command and killing that process after a small timeout window. That would be a risky approach in a Production environment and So, I prefer to use truss command manually and CTRL+C it after few seconds.

truss -d -E -o /tmp/truss.lst -p 7887

I can not stress enough, not to overuse these commands in a Production environment. Command strace( Linux), tusc (HP) are comparable commands of truss(Solaris).

Posted in CBO, EBS11i, Oracle database internals, Performance tuning | Tagged: , , , , , | 8 Comments »

gc buffer busy waits

Posted by Riyaj Shamsudeen on September 27, 2010

If you have the opportunity to work in a RAC environment, you probably encountered (or you will encounter soon :-) ) this wait event: ‘GC Buffer busy’. We will explore issues leading to excessive waits for this wait events and how to resolve the issue effectively.

What is a GC buffer busy wait?

In a simple sense, GC buffer busy means that the buffer in the buffer cache, that the session is trying to access is already involved in another ongoing global cache operation. Until that global cache operation completes, session must wait. I will explain this with an example: Let’s say that session #1 is trying to access a block of file #7 block ID 420. That block is in the remote cache and so, session #1 opened a BL lock on that block, requested the remote LMS process to send the block, and waiting for the block shipping to complete. Session #2 comes along shortly thereafter and tries to access the same buffer. But, the block is already involved in a global cache operation and so, session #2 must wait for the session #1 to complete GC (Global Cache) activity before proceeding. In this case, Session #2 will wait for ‘gc buffer busy’ wait event with a time-out and repeatedly tries to access that buffer in a loop.

Consider the scenario if the block is a hot block such as segment header, index branch block or transaction table header block etc. In this case, you can see that many such sessions waiting for the ‘Gc buffer busy’ wait event. This can lead to complex wait scenario quickly as few background processes also can wait for ‘gc buffer busy’ event leading to an eventual database hang situation. If you kill the processes, then pmon might need to access that block to do a rollback, which means that pmon can get stuck waiting for ‘gc buffer busy’ waits too.

Few Scenarios

This wait event can occur for many different reasons, including bugs. For example, I encountered a bug in which the index branch block split can cause excessive ‘gc buffer busy’ waits. So, It is not possible to document all scenarios that can lead to gc buffer busy waits. But, it is worth exploring few most common scenarios, and then discuss a mitigation plans for those scenarios. The methods discussed here will be helpful to understand which types of blocks are involved in this issue too.

1. Right hand growth indexes

Typically, applications generate surrogate keys using a sequence based key generation, an example would be employee_id column in the employee table. These types of unique or primary key columns are usually populated using a sequence generated value with a unique constraint on the column. A unique index may be created to support the unique constraint. Although, it is possible to create a non-unique index to support the unique constraint, and that non-unique index also will suffer from the issues similar to its unique counterparts. This problem is related to more about uniqueness of data and locality of new rows, rather than the type of index.

Indexes store sorted (key[s], ROWID) pair, meaning values in the index are arranged in an ascending or descending key column order. ROWIDs in the (key[s], ROWID) pair points to a specific row in the table segment with that index column value. Also, Indexes are implemented as B-Tree indexes. In the case of unique indexes, on columns populated by sequence based key values, recent entries will be in the right most leaf block of the B-Tree. All new rows will be stored in the right most leaf block of the index. As more and more sessions insert rows in to the table, that right most leaf block will be full. Oracle will split that right most leaf block in to two leaf blocks: One block with all rows except one row and a new block with just one row. (This split, aka Index 90-10 split, needs to modify branch block also ). Now that new leaf block becomes the right most leaf block and the concurrency moves to the new leaf block. Simply put, you can see concurrency issues moving from one block to another block in an orderly fashion.

As you can imagine, all sessions inserting in to the table will insert rows in to the current right most leaf block of the index. This type of index growth termed as “Right Hand Growth” Indexes. As sessions inserts in to the right most leaf block of the index, that index becomes hot block, and concurrency on that leaf block leads to performance issues.

In RAC, this problem is magnified. Sequence cache is instance specific and if the cache is small (defaults to 20), then the right most leaf block becomes hot block, not just in one instance, but across all instances. That hot – right most – leaf block will be transferred back and forth between the instances. If the block is considered busy, then LMS process might induce more delays in transferring the blocks between the instances. While the block is in transit, then the sessions accessing that block must wait on ‘GC buffer busy’ waits and this quickly leads to excessive GC buffer busy waits. Also, immediate branch block of those right most leaf block will play a role in the waits during leaf block splits.

So, how bad can it get? It can be very bad. A complete database hang situation is a possibility. Notice below that over 1000 sessions were waiting for ‘gc buffer busy’ events across the cluster. Application is completely down.


   INST_ID SQL_ID        EVENT              STATE    COUNT(*)
---------- ------------- ------------------ -------- --------
         4 4jtbgawt37mcd gc cr request     WAITING    9
         3 4jtbgawt37mcd gc cr request     WAITING    9
         3 a1bp5ytvpfj48 gc buffer busy    WAITING   11
         4 a1bp5ytvpfj48 gc buffer busy    WAITING   17
         4 14t0wadn1t0us gc buffer busy    WAITING   33
         4 gt1rdqk2ub851 gc buffer busy    WAITING   34
         4 a1bp5ytvpfj48 buffer busy waits WAITING   35
         2 a1bp5ytvpfj48 gc buffer busy    WAITING   65
         1 a1bp5ytvpfj48 gc buffer busy    WAITING  102
         2 7xzqcrdrnyw1j gc buffer busy    WAITING  106
         2 7xzqcrdrnyw1j enq: TX - index c WAITING  173
         1 7xzqcrdrnyw1j gc buffer busy    WAITING  198
         3 7xzqcrdrnyw1j gc buffer busy    WAITING  247
         4 7xzqcrdrnyw1j gc buffer busy    WAITING  247

How do you analyze the problem with right hand key indexes?

First, we need to verify that problem is due to right hand indexes. If you have access to ASH data, it is easy. For all sessions waiting for ‘gc buffer busy’ event query the current_obj#. Following query on ASH can provide you with the object_id involved in these waits. Also, make sure the statement is UPDATE or INSERT statements, not SELECT statement[ SELECT statements are discussed below].

select sample_time,  sql_id, event, current_obj#,sum (cnt)  from  gv$active_session_history
   where sample_time between  to_date ('24-SEP-2010 14:28:00','DD-MON-YYYY HH24:MI:SS') and
     to_date ('24-SEP-2010 14:29:59','DD-MON-YYYY HH24:MI:SS')
    group by  sample_time,  sql_id, event, current_obj#
   order by sample_time
/

SAMPLE_TIME                             |SQL_ID              |EVENT                                   |CURRENT_OBJ#|  COUNT(*)
----------------------------------------|--------------------|----------------------------------------|------------|----------
…
26-AUG-10 02.28.18.052 PM               |14t0wadn1t0us       |gc buffer busy                          |       8366|        33
..
select owner, object_name, object_type from dba_objects where object_id=8366 or data_object_id=8366;

In this example, current_obj# is 8366, which we can query the dba_objects to find the correct object_id. If this object is an unique index or almost unique index, then you might be running in to a right hand growth indexes.

If you don’t have access to ASH then, you need to sample gv$session_wait (or gv$session from 10g), group by p1, p2 to identify the blocks inducing ‘gc buffer busy’ waits. Then, map those blocks to objects suffering from the issue.

select event,    p1, p2, count(*) from  gv$session s
where
event ='gc buffer busy' and state='WAITING'
group by event, p1, p2
order by 4
/

You can also use my script segment_stats_delta.sql to see the objects suffering from ‘gc buffer busy’ waits. See below for an example use:

@segment_stats_delta.sql
segment_stats_delta.sql v1.01 by Riyaj Shamsudeen @orainternals.com

...Prints Change in segment statistics in the past N seconds.
...Default collection period is 60 seconds.... Please wait for at least 60 seconds...

!!! NOTICE !!! This scripts drops and recreates two types: segment_stats_tbl_type and segment_stats_type.

Following are the available statistics:
Pay close attention to sampled column below. Statistics which are sampled are not exact and so, it might not reflect immediately.

NAME                                                             SAM
---------------------------------------------------------------- ---
logical reads                                                    YES
buffer busy waits                                                NO
gc buffer busy                                                   NO
db block changes                                                 YES
...
gc cr blocks received                                            NO
...
Enter value for statistic_name: gc buffer busy
Enter value for sleep_duration: 60
WSH                            | WSH_DELIVERY_DETAILS_U1        | 34
APPLSYS                        | WF_ITEM_ATTRIBUTE_VALUES_PK    | 2
WSH                            | WSH_DELIVERY_DETAILS_U1        | 2
INV                            | MTL_MATERIAL_TRANSACTIONS_U1   | 1
ONT                            | OE_ORDER_LINES_ALL             | 1

PL/SQL procedure successfully completed.

How do you resolve ‘GC buffer busy’ waits due to right hand growth index?

In a simplistic sense, You need to reduce the concurrency on the right most leaf block. There are few options to reduce the concurrency, and hash partitioning that unique index (or almost unique index) is a better solution of all. For example, if we convert the unique index as hash partitioned index with 32 partitions, then you are reducing the concurrency on that right most leaf block by 32 fold. Why? In hash partitioning scheme with 32 partitions, there are 32 index trees and inserts will be spread across 32 right most leaf blocks. In contrast, there is just 1 index tree in the case of non-partitioned index. Essentially, Each partition gets its own index tree. Given an ID column value, that row is always inserted in to a specific partition and that partition is identified by applying hash function over the partitioning column.

In the case of non-partitioned index, for example, values from 1000 to 1010 will be stored in the right most leaf block of the index. In the case of partitioned index with 32 partitions, value 1000 will be stored in partition 24, value 1001 will be stored in partition 19, meaning, values are hashed and spread around 32 partitions leading to improved concurrency. This will completely eliminate Right hand Growth index concurrency issue.

Non-partitioned index

Hash partitioned index with 2 partitions

So, What is the drawback of hash partitioning indexes?

If your query needs to do range scan with a predicate similar to ‘id between 1000 and 1005′, then the index range scan will need to scan 32 index trees, instead of one index tree as in the case of non-partitioned tables. Unless, your application executes these sort of queries millions of times, you probably wouldn’t notice the performance difference. For equality predicate, such as ‘id=:B1′, this is not an issue as the database will need to scan just one index tree.

Let’s discuss about reverse key indexes too. As Michael Hallas and Greg Rahn of real-world performance group said (and I happily concur), reverse key indexes are evil. If you are in Oracle Database version 10g, you can create a partitioned index on a non-partitioned table. So, If your application is suffering from a right hand growth index contention issue, you can convert the non-partitioned index to a hash-partitioned index with minimal risk. So, there aren’t many reasons to use reverse key indexes in 10g [ Remember that range scan is not allowed in the reverse key indexes either]. But, if you are unfortunate enough to support Oracle 9i database, you can NOT create a partitioned index on a non-partitioned table. If your application suffers from right hand growth index concurrency issues in 9i, then your options may be limited to reverse key indexes (or playing with sequences and code change or better yet – upgrade to 10g).

2. What if the statement is a SELECT statement?

It is possible for the SELECT statement to suffer from gc buffer busy waits too. If you encounter a scenario in which the object is an index and the statement is a SELECT statement, then this paragraph applies to you. This issue typically happens if there is higher concurrency on few blocks. For example, excessive index full scan on an index concurrently from many instances can cause ‘gc buffer busy’ waits. In this case, right approach would be tune the SQL statement to avoid excessive access to those index blocks.

In my experience, gc buffer busy waits on SELECT statement generally happens if you have problems with statistics and execution plans. So, verify that execution plan or concurrency didn’t change recently.

3. Freelists, Freelist groups

What if the object ID we queried in Active Session History belongs to a table block and the statement is an INSERT statement? We need to check to see if that block is a segment header block. If there are many concurrent inserts, and if you don’t use ASSM tablespace, then the inserts need to find free blocks. Segment header of an object stores the free list [ if you don't use freelist groups]. So, concurrent inserts in to a table will induce excessive activity on the Segment header block of that table leading to ‘gc buffer busy’ waits.

Concurrent inserts in to a table with 1 freelist/1 freelist groups will also have contention in a non-segment header block too. When the session(s) searches for a free block in a freelist, all those sessions can get one or two free blocks to insert. This can lead to contention on that block.

Right approach in this case is to increase freelists, free list groups and initrans on those objects (and might need reorg for these parameters to take effect ). Better yet, use ASSM tablespaces to avoid these issues.

4. Other Scenarios

We discussed just few common issues. But, ‘gc buffer busy’ waits can happen for many reasons. Few of them are: CPU starvation issues, Swapping issues, interconnect issues etc. For example, if the process that opened the request for a block did not get enough CPU, then it might not drain the network buffers to copy the buffer to buffer cache. Other sessions accessing that buffer will wait on ‘gc buffer busy’ waits. Or If there is a network issue and the Global cache messages are slower, then it might induce higher gc buffer busy waits too. Statistics ‘gc lost packets’ is a good indicator for network issues, but not necessarily a complete indicator.

As a special case, if the sequences have been kept with lower cache value, then blocks belonging to seq$ table can be an issue too. Increasing cache for highly used sequence should be considered as a mandatory task in RAC.

Summary

In summary, next time you encounter this issue, drill down to see which object types and statements are involved. Debug to understand the root cause as ‘gc buffer busy’ waits are usually symptoms. You can read this blog in the traditional format as gc buffer busy waits_orainternals.pdf

Posted in Oracle database internals, Performance tuning, RAC | Tagged: , , , , | 13 Comments »

 
Follow

Get every new post delivered to your Inbox.

Join 210 other followers