Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

Posts Tagged ‘oracle performance’

gc buffer busy waits

Posted by Riyaj Shamsudeen on September 27, 2010

If you have the opportunity to work in a RAC environment, you probably encountered (or you will encounter soon :-) ) this wait event: ‘GC Buffer busy’. We will explore issues leading to excessive waits for this wait events and how to resolve the issue effectively.

What is a GC buffer busy wait?

In a simple sense, GC buffer busy means that the buffer in the buffer cache, that the session is trying to access is already involved in another ongoing global cache operation. Until that global cache operation completes, session must wait. I will explain this with an example: Let’s say that session #1 is trying to access a block of file #7 block ID 420. That block is in the remote cache and so, session #1 opened a BL lock on that block, requested the remote LMS process to send the block, and waiting for the block shipping to complete. Session #2 comes along shortly thereafter and tries to access the same buffer. But, the block is already involved in a global cache operation and so, session #2 must wait for the session #1 to complete GC (Global Cache) activity before proceeding. In this case, Session #2 will wait for ‘gc buffer busy’ wait event with a time-out and repeatedly tries to access that buffer in a loop.

Consider the scenario if the block is a hot block such as segment header, index branch block or transaction table header block etc. In this case, you can see that many such sessions waiting for the ‘Gc buffer busy’ wait event. This can lead to complex wait scenario quickly as few background processes also can wait for ‘gc buffer busy’ event leading to an eventual database hang situation. If you kill the processes, then pmon might need to access that block to do a rollback, which means that pmon can get stuck waiting for ‘gc buffer busy’ waits too.

Few Scenarios

This wait event can occur for many different reasons, including bugs. For example, I encountered a bug in which the index branch block split can cause excessive ‘gc buffer busy’ waits. So, It is not possible to document all scenarios that can lead to gc buffer busy waits. But, it is worth exploring few most common scenarios, and then discuss a mitigation plans for those scenarios. The methods discussed here will be helpful to understand which types of blocks are involved in this issue too.

1. Right hand growth indexes

Typically, applications generate surrogate keys using a sequence based key generation, an example would be employee_id column in the employee table. These types of unique or primary key columns are usually populated using a sequence generated value with a unique constraint on the column. A unique index may be created to support the unique constraint. Although, it is possible to create a non-unique index to support the unique constraint, and that non-unique index also will suffer from the issues similar to its unique counterparts. This problem is related to more about uniqueness of data and locality of new rows, rather than the type of index.

Indexes store sorted (key[s], ROWID) pair, meaning values in the index are arranged in an ascending or descending key column order. ROWIDs in the (key[s], ROWID) pair points to a specific row in the table segment with that index column value. Also, Indexes are implemented as B-Tree indexes. In the case of unique indexes, on columns populated by sequence based key values, recent entries will be in the right most leaf block of the B-Tree. All new rows will be stored in the right most leaf block of the index. As more and more sessions insert rows in to the table, that right most leaf block will be full. Oracle will split that right most leaf block in to two leaf blocks: One block with all rows except one row and a new block with just one row. (This split, aka Index 90-10 split, needs to modify branch block also ). Now that new leaf block becomes the right most leaf block and the concurrency moves to the new leaf block. Simply put, you can see concurrency issues moving from one block to another block in an orderly fashion.

As you can imagine, all sessions inserting in to the table will insert rows in to the current right most leaf block of the index. This type of index growth termed as “Right Hand Growth” Indexes. As sessions inserts in to the right most leaf block of the index, that index becomes hot block, and concurrency on that leaf block leads to performance issues.

In RAC, this problem is magnified. Sequence cache is instance specific and if the cache is small (defaults to 20), then the right most leaf block becomes hot block, not just in one instance, but across all instances. That hot – right most – leaf block will be transferred back and forth between the instances. If the block is considered busy, then LMS process might induce more delays in transferring the blocks between the instances. While the block is in transit, then the sessions accessing that block must wait on ‘GC buffer busy’ waits and this quickly leads to excessive GC buffer busy waits. Also, immediate branch block of those right most leaf block will play a role in the waits during leaf block splits.

So, how bad can it get? It can be very bad. A complete database hang situation is a possibility. Notice below that over 1000 sessions were waiting for ‘gc buffer busy’ events across the cluster. Application is completely down.

   INST_ID SQL_ID        EVENT              STATE    COUNT(*)
---------- ------------- ------------------ -------- --------
         4 4jtbgawt37mcd gc cr request     WAITING    9
         3 4jtbgawt37mcd gc cr request     WAITING    9
         3 a1bp5ytvpfj48 gc buffer busy    WAITING   11
         4 a1bp5ytvpfj48 gc buffer busy    WAITING   17
         4 14t0wadn1t0us gc buffer busy    WAITING   33
         4 gt1rdqk2ub851 gc buffer busy    WAITING   34
         4 a1bp5ytvpfj48 buffer busy waits WAITING   35
         2 a1bp5ytvpfj48 gc buffer busy    WAITING   65
         1 a1bp5ytvpfj48 gc buffer busy    WAITING  102
         2 7xzqcrdrnyw1j gc buffer busy    WAITING  106
         2 7xzqcrdrnyw1j enq: TX - index c WAITING  173
         1 7xzqcrdrnyw1j gc buffer busy    WAITING  198
         3 7xzqcrdrnyw1j gc buffer busy    WAITING  247
         4 7xzqcrdrnyw1j gc buffer busy    WAITING  247

How do you analyze the problem with right hand key indexes?

First, we need to verify that problem is due to right hand indexes. If you have access to ASH data, it is easy. For all sessions waiting for ‘gc buffer busy’ event query the current_obj#. Following query on ASH can provide you with the object_id involved in these waits. Also, make sure the statement is UPDATE or INSERT statements, not SELECT statement[ SELECT statements are discussed below].

select sample_time,  sql_id, event, current_obj#,sum (cnt)  from  gv$active_session_history
   where sample_time between  to_date ('24-SEP-2010 14:28:00','DD-MON-YYYY HH24:MI:SS') and
     to_date ('24-SEP-2010 14:29:59','DD-MON-YYYY HH24:MI:SS')
    group by  sample_time,  sql_id, event, current_obj#
   order by sample_time

SAMPLE_TIME                             |SQL_ID              |EVENT                                   |CURRENT_OBJ#|  COUNT(*)
26-AUG-10 PM               |14t0wadn1t0us       |gc buffer busy                          |       8366|        33
select owner, object_name, object_type from dba_objects where object_id=8366 or data_object_id=8366;

In this example, current_obj# is 8366, which we can query the dba_objects to find the correct object_id. If this object is an unique index or almost unique index, then you might be running in to a right hand growth indexes.

If you don’t have access to ASH then, you need to sample gv$session_wait (or gv$session from 10g), group by p1, p2 to identify the blocks inducing ‘gc buffer busy’ waits. Then, map those blocks to objects suffering from the issue.

select event,    p1, p2, count(*) from  gv$session s
event ='gc buffer busy' and state='WAITING'
group by event, p1, p2
order by 4

You can also use my script segment_stats_delta.sql to see the objects suffering from ‘gc buffer busy’ waits. See below for an example use:

segment_stats_delta.sql v1.01 by Riyaj Shamsudeen

...Prints Change in segment statistics in the past N seconds.
...Default collection period is 60 seconds.... Please wait for at least 60 seconds...

!!! NOTICE !!! This scripts drops and recreates two types: segment_stats_tbl_type and segment_stats_type.

Following are the available statistics:
Pay close attention to sampled column below. Statistics which are sampled are not exact and so, it might not reflect immediately.

NAME                                                             SAM
---------------------------------------------------------------- ---
logical reads                                                    YES
buffer busy waits                                                NO
gc buffer busy                                                   NO
db block changes                                                 YES
gc cr blocks received                                            NO
Enter value for statistic_name: gc buffer busy
Enter value for sleep_duration: 60
WSH                            | WSH_DELIVERY_DETAILS_U1        | 34
APPLSYS                        | WF_ITEM_ATTRIBUTE_VALUES_PK    | 2
WSH                            | WSH_DELIVERY_DETAILS_U1        | 2
INV                            | MTL_MATERIAL_TRANSACTIONS_U1   | 1
ONT                            | OE_ORDER_LINES_ALL             | 1

PL/SQL procedure successfully completed.

How do you resolve ‘GC buffer busy’ waits due to right hand growth index?

In a simplistic sense, You need to reduce the concurrency on the right most leaf block. There are few options to reduce the concurrency, and hash partitioning that unique index (or almost unique index) is a better solution of all. For example, if we convert the unique index as hash partitioned index with 32 partitions, then you are reducing the concurrency on that right most leaf block by 32 fold. Why? In hash partitioning scheme with 32 partitions, there are 32 index trees and inserts will be spread across 32 right most leaf blocks. In contrast, there is just 1 index tree in the case of non-partitioned index. Essentially, Each partition gets its own index tree. Given an ID column value, that row is always inserted in to a specific partition and that partition is identified by applying hash function over the partitioning column.

In the case of non-partitioned index, for example, values from 1000 to 1010 will be stored in the right most leaf block of the index. In the case of partitioned index with 32 partitions, value 1000 will be stored in partition 24, value 1001 will be stored in partition 19, meaning, values are hashed and spread around 32 partitions leading to improved concurrency. This will completely eliminate Right hand Growth index concurrency issue.

Non-partitioned index

Hash partitioned index with 2 partitions

So, What is the drawback of hash partitioning indexes?

If your query needs to do range scan with a predicate similar to ‘id between 1000 and 1005’, then the index range scan will need to scan 32 index trees, instead of one index tree as in the case of non-partitioned tables. Unless, your application executes these sort of queries millions of times, you probably wouldn’t notice the performance difference. For equality predicate, such as ‘id=:B1’, this is not an issue as the database will need to scan just one index tree.

Let’s discuss about reverse key indexes too. As Michael Hallas and Greg Rahn of real-world performance group said (and I happily concur), reverse key indexes are evil. If you are in Oracle Database version 10g, you can create a partitioned index on a non-partitioned table. So, If your application is suffering from a right hand growth index contention issue, you can convert the non-partitioned index to a hash-partitioned index with minimal risk. So, there aren’t many reasons to use reverse key indexes in 10g [ Remember that range scan is not allowed in the reverse key indexes either]. But, if you are unfortunate enough to support Oracle 9i database, you can NOT create a partitioned index on a non-partitioned table. If your application suffers from right hand growth index concurrency issues in 9i, then your options may be limited to reverse key indexes (or playing with sequences and code change or better yet – upgrade to 10g).

2. What if the statement is a SELECT statement?

It is possible for the SELECT statement to suffer from gc buffer busy waits too. If you encounter a scenario in which the object is an index and the statement is a SELECT statement, then this paragraph applies to you. This issue typically happens if there is higher concurrency on few blocks. For example, excessive index full scan on an index concurrently from many instances can cause ‘gc buffer busy’ waits. In this case, right approach would be tune the SQL statement to avoid excessive access to those index blocks.

In my experience, gc buffer busy waits on SELECT statement generally happens if you have problems with statistics and execution plans. So, verify that execution plan or concurrency didn’t change recently.

3. Freelists, Freelist groups

What if the object ID we queried in Active Session History belongs to a table block and the statement is an INSERT statement? We need to check to see if that block is a segment header block. If there are many concurrent inserts, and if you don’t use ASSM tablespace, then the inserts need to find free blocks. Segment header of an object stores the free list [ if you don’t use freelist groups]. So, concurrent inserts in to a table will induce excessive activity on the Segment header block of that table leading to ‘gc buffer busy’ waits.

Concurrent inserts in to a table with 1 freelist/1 freelist groups will also have contention in a non-segment header block too. When the session(s) searches for a free block in a freelist, all those sessions can get one or two free blocks to insert. This can lead to contention on that block.

Right approach in this case is to increase freelists, free list groups and initrans on those objects (and might need reorg for these parameters to take effect ). Better yet, use ASSM tablespaces to avoid these issues.

4. Other Scenarios

We discussed just few common issues. But, ‘gc buffer busy’ waits can happen for many reasons. Few of them are: CPU starvation issues, Swapping issues, interconnect issues etc. For example, if the process that opened the request for a block did not get enough CPU, then it might not drain the network buffers to copy the buffer to buffer cache. Other sessions accessing that buffer will wait on ‘gc buffer busy’ waits. Or If there is a network issue and the Global cache messages are slower, then it might induce higher gc buffer busy waits too. Statistics ‘gc lost packets’ is a good indicator for network issues, but not necessarily a complete indicator.

As a special case, if the sequences have been kept with lower cache value, then blocks belonging to seq$ table can be an issue too. Increasing cache for highly used sequence should be considered as a mandatory task in RAC.


In summary, next time you encounter this issue, drill down to see which object types and statements are involved. Debug to understand the root cause as ‘gc buffer busy’ waits are usually symptoms. You can read this blog in the traditional format as gc buffer busy waits_orainternals.pdf


Posted in Oracle database internals, Performance tuning, RAC | Tagged: , , , , | 16 Comments »

Query transformation – Part 1

Posted by Riyaj Shamsudeen on May 1, 2010

Query transformation is a set of techniques used by the optimizer to rewrite a query and optimizer it better. Few optimization paths open up to the optimizer after query transformation. Some query transformations must be costed to be chosen and some do not need to be costed. For example, if a table can be eliminated completely from the join, then that transformation is applied and need to cost that transformation is minimal.

Test case

We will use the following test case to illustrate the concepts behind Query transformation. Some of the optimizations that we see here works from version 11gR1 onwards and so, these test cases might not work in the versions 10g and below.

create table backup.t1 (n1 number not null primary key, n2 number not null, n3 varchar2(256) );
insert into backup.t1 select n1, n1, lpad ( n1, 250,'x') from (select level n1 from dual connect by level <=100);
create table backup.t2 (n1 number not null primary key , n2 number not null, n3 varchar2(256) );
alter table backup.t1 add constrainT t1_fk foreign key  (n2) references backup.t2(n1)
insert into backup.t2 select n1, n1, lpad ( n1, 250,'x') from (select level n1 from dual connect by level <=100);
insert into backup.t1 select n1, n1, lpad ( n1, 250,'x') from (select level n1 from dual connect by level <=100);

Join elimination (JE)

JE is a technique in which one or more tables can be eliminated from the execution plan without altering functional behavior. In the listing 1-1, query selects columns from the table t1 only, but there exists a join predicate between t1 and t2 in that query. Further, no columns are selected from table t2 in this query and join to t2 simply serves as to verify the existence of foreign key values. Enabled Foreign key constraint between these two tables establishes the existence check already and so, there is no need for explicit existence check in the query also. Join to table t2 can be eliminated by the optimizer safely.

select /*+ gather_plan_statistics */ t1.* from t1, t2 where t1.n2 = t2.n1;
select *  from table(dbms_xplan.display_cursor('','','allstats last'))
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
|   1 |  TABLE ACCESS FULL| T1   |      1 |     82 |    100 |00:00:00.01 |      13 |

Listing 1-1 :JE example case 1

As you see from the listing 1-1, Table T2 is removed from the execution plan. Since there is a valid foreign key constraint, optimizer eliminated the join condition to that table t2.

Let’s also discuss another Join Elimination test case. In the Listing 1-2, predicate is “t1.n2 not in (select t2.n1 from t2)”. As the enabled foreign key constraint dictates that this predicate will always be false and no rows will be returned. Optimizer promptly identified this condition and added a filter predicate in the step 1 with “NULL is NOT NULL” as a predicate. Step 1 is executed before step 2; Step 2 is never executed as the value of Starts column is zero in the execution plan.

SQL_ID  d09kmzum9wgta, child number 0
select /*+ gather_plan_statistics */ t1.* from t1 where t1.n2 not in
(select t2.n1 from t2 )

Plan hash value: 3332582666
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   |
|   0 | SELECT STATEMENT   |      |      1 |        |      0 |00:00:00.01 |
|*  1 |  FILTER            |      |      1 |        |      0 |00:00:00.01 |
|   2 |   TABLE ACCESS FULL| T1   |      0 |    100 |      0 |00:00:00.01 |
Predicate Information (identified by operation id):
   1 - filter(NULL IS NOT NULL)

Listing 1-2 :JE example case 2

Listing 1-3 provides another variation of JE.

select /*+ gather_plan_statistics */ t1.* from t1 where t1.n2 in
(select t2.n1 from t2 )

Plan hash value: 3617692013
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
|   0 | SELECT STATEMENT  |      |      1 |        |    100 |00:00:00.01 |      14 |
|   1 |  TABLE ACCESS FULL| T1   |      1 |    100 |    100 |00:00:00.01 |      14 |
Listing 1-3 :JE example case 2

Following output shows the trace lines from the 10053 trace file.
JE: Considering Join Elimination on query block SEL$5DA710D3 (#1)
Join Elimination (JE)
JE: cfro: T1 objn:74684 col#:2 dfro:T2 dcol#:2
JE: cfro: T1 objn:74684 col#:2 dfro:T2 dcol#:2
Query block (0E0D43D0) before join elimination:
SQL:******* UNPARSED QUERY IS *******
SELECT “T1”.”N1″ “N1″,”T1”.”N2″ “N2″,”T1”.”N3″ “N3” FROM “CBO3”.”T2″ “T2″,”CBO3”.”T1″ “T1” WHERE “T1″.”N2″=”T2”.”N1″
JE: eliminate table: T2
Registered qb: SEL$14EF7918 0xe0d43d0 (JOIN REMOVED FROM QUERY BLOCK SEL$5DA710D3; SEL$5DA710D3; “T2″@”SEL$2”)

Filter Predicate(s) Generation from constraints

Various filter predicates are generated and added to the execution plan using enabled and validated constraints (check, not null constraints).

In the Listing 1-4, columns n1 and n2 has enabled valid NOT NULL constraints that precludes null values in the columns n1 and n2. Query in the listing 1-4 has predicate “n1 is null or n2 is null” which can never be true. This fact is used by the optimizer to improve the execution plan. Filter predicate (NULL IS NOT NULL) is added in step 1 which will be FALSE. So, Step 2 is never executed as the value of Starts column is 0 in the execution plan. This means that step (2) in the execution plan was never executed and table T1 was never accessed.

select /*+ gather_plan_statistics */ * from t1 where n1 is null or n2 is null;
select *  from table(dbms_xplan.display_cursor('','','allstats last'));
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   |
|*  1 |  FILTER            |      |      1 |        |      0 |00:00:00.01 |
|   2 |   TABLE ACCESS FULL| T1   |      0 |    100 |      0 |00:00:00.01 |
Predicate Information (identified by operation id):
   1 - filter(NULL IS NOT NULL)

Listing 1-4: Filter predicate generation from NOT NULL constraint.

Let’s add a check constraint to this column to explain this further. In the listing 1-5 a check constraint is added which specifies that “n1 200 and generated predicate n1<200 will nullify each other leading to an always FALSE condition. Optimizer identified this condition and added a filter predicate in step 1: NULL IS NOT NULL.

alter table t1 add constraint t1_n1_lt_150 check (n1 200;
select *  from table(dbms_xplan.display_cursor('','','allstats last'));
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   |
|*  1 |  FILTER            |      |      1 |        |      0 |00:00:00.01 |
|*  2 |   TABLE ACCESS FULL| T1   |      0 |     20 |      0 |00:00:00.01 |

Predicate Information (identified by operation id):
   1 - filter(NULL IS NOT NULL)
   2 - filter("N1">200)

Listing 1-5:Filter predicate generation from a check constraint

Following lines from the trace file generated from event 10053 shows that a predicate n1<200 is added; This auto-generated predicate and existing predicate canceled each other leading to an eternally FALSE condition.

kkogcp: try to generate transitive predicate from check constraints for SEL$5DA710D3 (#0)
constraint: “T1”.”N1″<200
predicates with check contraints: "T1"."N2"="T2"."N1" AND "T1"."N1"<200
after transitive predicate generation: "T1"."N2"="T2"."N1" AND "T1"."N1"<200
finally: "T1"."N2"="T2"."N1"
apadrv-start: call(in-use=1056, alloc=16344), compile(in-use=44792, alloc=46272)

SJC: Set to Join Conversion

In some cases, the optimizer can convert a set operator to a join operator. Interestingly, this feature is not enabled by default (up to 11gR1). In the listing 1-6, we enable this parameter. A MINUS set operation has been converted to a join operation.

alter session set "_convert_set_to_join"=true;
select /*+ gather_plan_statistics  */ n2 from t1 minus select n1 from t2
Plan hash value: 3050591313
| Id  | Operation           | Name         | Starts | E-Rows | A-Rows |   A-Time   |...
|   0 | SELECT STATEMENT    |              |      1 |        |      0 |00:00:00.01 |...
|   1 |  HASH UNIQUE        |              |      1 |     99 |      0 |00:00:00.01 |
|   2 |   NESTED LOOPS ANTI |              |      1 |     99 |      0 |00:00:00.01 |
|   3 |    TABLE ACCESS FULL| T1           |      1 |    100 |    100 |00:00:00.01 |
|*  4 |    INDEX UNIQUE SCAN| SYS_C0010995 |    100 |      1 |    100 |00:00:00.01 |
Predicate Information (identified by operation id):
   4 - access("N2"="N1")

Listing 1-6: SJC

There is also a new hint set_to_join with this new feature.

      OPT_PARAM('_convert_set_to_join' 'true')

SU: Subquery Unnesting

Subqueries can be unnested in to a join. Listing 1-7 shows that a subquery is unnested in to a view and then joined to other row sources. In this listing, a correlated subquery is moved in to a view VW_SQ_1, unnested and then joined using Nested Loops Join technique. There are many different variations of Subquery Unnesting possible, but the crux of the matter is that subqueries can be unnested, joined and then costed.

 select /*+ gather_plan_statistics  */ n1 from t1 where n1 >  
	(select max(n2) from t2 where t2.n1 = t1.n1)
Plan hash value: 2311753844
| Id  | Operation                       | Name         | Starts | E-Rows | A-Rows |
|   0 | SELECT STATEMENT                |              |      1 |        |      0 |
|   1 |  NESTED LOOPS                   |              |      1 |      1 |      0 |
|   2 |   VIEW                          | VW_SQ_1      |      1 |      5 |    100 |
|*  3 |    FILTER                       |              |      1 |        |    100 |
|   4 |     HASH GROUP BY               |              |      1 |      5 |    100 |
|   5 |      TABLE ACCESS BY INDEX ROWID| T2           |      1 |    100 |    100 |
|*  6 |       INDEX RANGE SCAN          | SYS_C0010995 |      1 |    100 |    100 |
|*  7 |   INDEX UNIQUE SCAN             | SYS_C0010992 |    100 |      1 |      0 |
Predicate Information (identified by operation id):
   3 - filter(MAX("N2")<200)
   6 - access("T2"."N1""MAX(N2)")

SU is one reason why there are many performance issues after a database upgrade to 10g and above. Cost of unnested subquery will go up or down leading to an unfortunate choice of not-so-optimal execution plan.

Use of ORDERED hint can really play havoc with SU feature too. For example, in the listing 1-8, join between t1 and t2 is preferred followed by other joins. You would expect to see the leading table in the join to be T1, but the leading row source is VW_SQ_1.

select /*+ gather_plan_statistics ORDERED  */ t1.n1, t2.n1 from t1 , t2
where t1.n1 = t2.n1 and t1.n1 > (select max(n2) from t2 where t2.n1 =t1.n1)
Plan hash value: 3904485247
| Id  | Operation                        | Name         | Starts | E-Rows | A-Rows |
|   0 | SELECT STATEMENT                 |              |      1 |        |      0 |...
|   1 |  NESTED LOOPS                    |              |      1 |      1 |      0 |
|   2 |   NESTED LOOPS                   |              |      1 |      1 |      0 |
|   3 |    VIEW                          | VW_SQ_1      |      1 |      5 |    100 |
|*  4 |     FILTER                       |              |      1 |        |    100 |
|   5 |      HASH GROUP BY               |              |      1 |      5 |    100 |
|   6 |       TABLE ACCESS BY INDEX ROWID| T2           |      1 |    100 |    100 |
|*  7 |        INDEX RANGE SCAN          | SYS_C0010995 |      1 |    100 |    100 |
|*  8 |    INDEX UNIQUE SCAN             | SYS_C0010992 |    100 |      1 |      0 |
|*  9 |   INDEX UNIQUE SCAN              | SYS_C0010995 |      0 |      1 |      0 |

Predicate Information (identified by operation id):
   4 - filter(MAX("N2")<200)
   7 - access("T2"."N1""MAX(N2)")
   9 - access("T1"."N1"="T2"."N1")

Is CBO not honoring our hint? It is honoring our hint. Except that ORDERED hint was applied after the SU transformation and so, unnested view is in the leading row source. A variation of the transfromed query from 10053 trace file is printed below. With ORDERED hint, of course, CBO must choose the unnested view as the leading row source. Use LEADING hint instead of ORDERED hint if necessary.

SELECT /*+ ORDERED */ “T1”.”N1″ “N1″,”T2”.”N1″ “N1” FROM
(SELECT MAX(“T2”.”N2″) “MAX(N2)”,”T2″.”N1″ “ITEM_1” FROM “CBO3”.”T2″ “T2” GROUP BY “T2”.”N1″) “VW_SQ_2”,
“CBO3”.”T1″ “T1″,”CBO3”.”T2″ “T2” WHERE “T1″.”N1″=”T2”.”N1″ AND “T1″.”N1″>”VW_SQ_2″.”MAX(N2)” AND “VW_SQ_2″.”ITEM_1″=”T1”.”N1″


There are many techniques to cover in one blog entry. We will discuss these features further in upcoming blogs. This can be read in more conventional format: query_transformation_part_1_orainternals

Posted in CBO, Oracle database internals, Performance tuning | Tagged: , , , , , | 8 Comments »

RAC object remastering ( Dynamic remastering )

Posted by Riyaj Shamsudeen on March 25, 2010

In RAC, every data block is mastered by an instance. Mastering a block simply means that master instance keeps track of the state of the block until the next reconfiguration event (due to instance restart or otherwise).

Hash to the master

These data blocks are mastered in block ranges. For example, range of blocks starting from file 10, block 1 through block 128 may be mastered by instance 1, blocks from file 10, block 129 through 256 are mastered by instance 2 etc. Of course, there are differences between various versions 10g, 11g etc, but Idea here is that block ranges are uniformly mastered between various instances so that Global cache grants are evenly distributed among the instances. Interestingly, length of the block range is 128 from 10g onwards (Julian Dyke mentioned that is 1089 in 9i, but I have not personally tested it). Of course, Support recommends you to unset db_file_multiblock_read_count which will be auto adjusted to 128 which means that Full block range can be read with fewer GC messages, I suppose. I digress.

Further, Michael Möller pointed out that this hash-algorithm is further optimized: The hash-algorithm used when initially computing the master node from the DBA, results in a “virtual master”, which is then translated to a real (online&open) master by a lookup table (the length of which is the maximum number of possible nodes (128 ?). This means that when one node goes off/on-line, RAC does NOT have to recalculate the hash for all blocks, but only distribute the new Hash-to-node table. (One can later visualize dynamic remastering as an additional lookup table between the hash value and node. This table also needs redistributing on node changes.)

Following SQL is helpful in showing masters and owners of the block. This SQL joins, x$kjbl with x$le to retrieve resource name. If you are familiar with Oracle locking strategy, you would probably recognize the format of these cache fusion (aka old PCM) locks. Lock type in this case is BL, id1 is block# and id2 is file_id in this case. Column kjblname2 provides a decimal format lock resource.

Please observe the output below:

  1. Block range: File 1, block 6360-6376 is mastered by node 3 and also owned by node 3.
  2. Block range: File 1, blocks upto 10709 is mastered by instance 0 and owned by instance 3.
  3. Next block range from 10770 is mastered by instance 2 and owned by 3.

Also, note that this output is generated from a database with no remastering done yet.

REM In kjblname2 first entry before ',' is block and seond entry file_id*65536 for BL locks

select kj.*, le.le_Addr from (
select kjblname, kjblname2, kjblowner, kjblmaster, kjbllockp,
substr ( kjblname2,  instr(kjblname2,',')+1,   instr(kjblname2,',',1,2)-instr(kjblname2,',',1,1)-1)/65536 fl,
substr ( kjblname2, 1, instr(kjblname2,',')-1) blk
 from x$kjbl
) kj, x$le le
where le.le_kjbl = kj.kjbllockp
 order by le.le_addr

---------------------- --------------- ---------- ----------- --- ----- ----------------
[0x18d8][0x10000],[BL] 6360,65536,BL           3          3     1  6360 000000038DF9AB08
[0x18e7][0x10000],[BL] 6375,65536,BL           3          3     1  6375 000000038DFBF818 => case #2
[0x18e8][0x10000],[BL] 6376,65536,BL           3          3     1  6376 000000038DFD3BA0
[0x29d1][0x10000],[BL] 10705,65536,BL          3          0     1 10705 00000005D6FE9230
[0x29d5][0x10000],[BL] 10709,65536,BL          3          0     1 10709 000000038EFBB990
[0x2a12][0x10000],[BL] 10770,65536,BL          3          2     1 10770 000000075FFE3C18
[0x2a18][0x10000],[BL] 10776,65536,BL          2          2     1 10776 000000038EFA8488 => case #1 
[0x2a19][0x10000],[BL] 10777,65536,BL          3          2     1 10777 000000038EFB7F90 
[0x2a1a][0x10000],[BL] 10778,65536,BL          1          2     1 10778 000000038EFCC318 

Let’s consider three simple cases of SELECT statement running instance 3:

  1. A session is trying to access the block file 1, block 10776, but that block is mastered by instance 2 and also that block is owned by instance 2 (meaning, it is in instance 2 cache). So, instance 3 will sent a PR (Protected Read) mode BL lock request on that block to instance 2. Ignoring additional complexities, instance 2 will grant PR mode lock to instance 3 and transfer the block to instance 3. Obviously, this involves multiple GC messages, grants and block transfer. Statistics ‘gc remote grants’ gets incremented too.
  2. Let’s consider that session is trying to access another block: file 1, block 6375. That block is mastered by instance 3 and also owned by instance 3. At this point, there is no additional GCS/GES processing is needed and the session pin that buffer and continue the work.
  3. Let’s consider a third case. Session is trying to access file 1 block 6374. That block is not in any buffer cache, but instance 3 is master of the block, so local affinity locks are acquired with minimal GC messages and waits. That block is read from the disk in to the buffer cache

In the case #2 and #3 above, requesting instance also is the master node of a block or block range. In these cases, statistics ‘gc local grants’ is incremented and cheaper local affinity locks on those block ranges are acquired avoiding many Global cache messages.

So far so good, but what if, say instance 1, is reading one object (table, index etc) aggressively, but other instances are not reading that object at all? [ through some sort of application node partitioning or just plain workload]. Does it make sense for the instance accessing that object aggressively request a grant to the remote instance(s) for each OPEN on that object’s blocks? Especially, if the blocks are read in to the buffer cache, but disappears soon from the buffer cache? Wouldn’t that be better if the instance reading that object aggressively is also the master of that object, as in the cases #2 and #3 above?

In addition to that, if the block is supposed to be thrown away from buffer cache (close of BL lock) or if the block needs to be written, then that will involve additional overhead/messaging between the master instance and owner instance since the ownership needs to be communicated back to the master of the block.

Enter Object remastering.

Object Remastering

There are many new features in 10g/11g RAC stack. One of them is Object remastering feature. This feature was implemented in 10gR1 and improved in 10gR2 and further enhanced in 11g. I realize there are parameters in 9i also, but I don’t think it worked as intended.

With object remastering feature, if an object is accessed by an instance aggressively, then that instance will become the master of the object reducing gc remote grants improving performance of the application. In the prior sentence, I used the word “accessed”, but it is a loose term, and the correct term is if the instance is requesting much BL locks on an object, then that object can be remastered. In an ideal world, even if the application is not partitioned, remastering of the objects that were accessed aggressively from one instance will acquire cheaper local instance affinity locks and effective RAC Tax will be minimal.

Well, I said, in an ideal world :-) There are few issues here:

  1. Instance do not remember prior mastership across restarts. This means that instance needs to re-learn the object mastership map after every restart. I can see the complexities of remembering the mastership, but it is possible to implement that.
  2. Remastering is not exactly cheap. Instance GRD is frozen during reconfiguration and in a very busy instances, this can take many seconds leading to instance freeze for several seconds. While 10gR2 introduce parallel reconfiguration (_rcfg_parallel_Replay parameter controls this behavior) using all LMS processes to complete the reconfiguration, still, several seconds of freeze is not exactly acceptable in many environments.
  3. I advice my clients to keep LMS processes to a lower value (3 to 5), at the most, but instance reconfiguration effective parallelism is reduced if we reduce number of LMS processes.
  4. Last, but not the least important point is that, default values of few parameters that trigger remastering events are quite low for busy environments causing frequent remastering of objects. In an E-Business World, minor mismanagement in the manager configuration can lead to a massive reconfiguration issues.

Parameters, views and internals

Few parameters are controlling this behavior, not well documented, my test case results are not very accurate either. But, these parameters are giving us a picture of what is going on internally. These parameters are applicable to 10gR2 and below. For 11g, whole set of different parameters comes in to play and I will blog about the differences in another blog entry.

X$object_affinity_statistics maintains the statistics about objects and OPENs on those objects. It is important to understand the difference between OPEN and Buffer access. If the block is in the cache already in a suitable mode, there is no need for BL opens on that block. So, if the sessions are accessing the same block repeatedly without requesting any additional BL locks, then the count is not incremented. So, OPEN is simply a number of BL request initiated in an ephimeral time frame.

LCK0 process maintains these object affinity statistics. If an instance opens 50 more opens on an object then the other instance (controlled by _gc_affinity_limit parameter), then that object is a candidate for remastering. That object is queued and LMD0 reads the queue and initiates GRD freeze. LMON performs reconfiguration of buffer cache locks working with LMS processes. All these are visible in LMD0/LMON trace files. Parameter _gc_affinity_time controls how often the queue is checked to see if the remastering must be triggered or not with a default value of 10 minutes.

Now, you don’t want just any object as a candidate for remastering, meaning, if instance 1 opened 101 BL locks on that object and instance 2 opened 50 BL locks on that object, you don’t want to trigger object remastering. Only objects with higher amount of BL lock requests must be queued for remastering. Well, that threshold seems to be controlled by another parameter _gc_affinity_minimum: This parameter is defined as “minimum amount of dynamic affinity activity per minute” to be a candidate for remastering. Defaults to 2500 and I think, it is lower in a busy environment.

Few lines from LMD0 trace files showing that LMD0 is reading a request queue:

* kjdrchkdrm: found an RM request in the request queue 
  Transfer pkey 6589904 to node 3
*** 2009-10-12 11:41:20.257

How bad can it get?

Performance can suffer if there are remastering issues. Following AWR report shows that few instances froze due to DRM reconfiguration issue. Same type of freeze is visible in all other nodes too. gc buffer busy is a side effect of DRM freeze (not always, but in this case).

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
gc buffer busy                    1,826,073     152,415     83   62.0    Cluster
CPU time                                         30,192          12.3
enq: TX - index contention           34,332      15,535    453    6.3 Concurrenc
gcs drm freeze in enter server       22,789      11,279    495    4.6      Other
enq: TX - row lock contention        46,926       4,493     96    1.8 Applicatio

At the same time, a storm of DRM were started. This lead to repetitive DRM freeze and instance configuration leading to massive performance issues.

* received DRM start msg from 2 (cnt 5, last 1, rmno 404) 
Rcvd DRM(404) Transfer pkey 1598477 from 3 to 2 oscan 1.1
Rcvd DRM(404) Dissolve pkey 6100030 from 2 oscan 0.1
Rcvd DRM(404) Dissolve pkey 5679943 from 2 oscan 0.1
Rcvd DRM(404) Transfer pkey 6561036 from 0 to 2 oscan 1.1
Rcvd DRM(404) Transfer pkey 5095243 to 2 oscan 0.1

A small test case

Let’s walk through a test case that shows DRM in play. Query used index access path to read nearly all blocks from a big index.

Session #1: 
select data_object_id from dba_objects where object_name='WMS_N1';

REM No affinity statistics yet.
select * from x$object_affinity_statistics where object=6099472;
no rows selected.
REM executing a costly select statement
select /*+ index(a WMS_N1 */ count(*) from big_table1 a;

Session #2: I was monitoring the DRM tables:
REM DRM operations completed so far is 409. We will keep track of this count to see the remastering count intereasing. There are few
REM other interesting fields in this view.
select drms from X$KJDRMAFNSTATS;
REM  I see that 23442 opens on that index already since the session #1 started running. 
select * from x$object_affinity_statistics where object=6099472;
ADDR                   INDX    INST_ID     OBJECT       NODE      OPENS
---------------- ---------- ---------- ---------- ---------- ----------
FFFFFFFF7C05BFA8         14          1    6099472          1      23442
REM No mastering has kicked in for that object.
select * from v$gcspfmaster_info where object_id=6099472;
no rows selected
REM few seconds later, open count increased from 23442 -> 33344.
select * from x$object_affinity_statistics where object=6099472;
ADDR                   INDX    INST_ID     OBJECT       NODE      OPENS
---------------- ---------- ---------- ---------- ---------- ----------
FFFFFFFF7C05BFA8         16          1    6099472          1      33344

REM No remastering has kicked in for that object. 
select * from v$gcspfmaster_info where object_id=6099472;
no rows selected

REM Surprisingly, while session #1 is still executing, the counter for OPENS was zeroed out even when though DRM has not been triggered yet.
REM  OPENS Increased to 1229 from 0 since the session #1 is still executing.
ADDR                   INDX    INST_ID     OBJECT       NODE      OPENS
---------------- ---------- ---------- ---------- ---------- ----------
FFFFFFFF7C05BFA8          0          1    6099472          1       1229
REM Approximately 10 minutes or so later, Remastering kicked in..
REM # of DRMS increased from 409 to 411
select drms from X$KJDRMAFNSTATS;
REM Remastering of that index happened. Now the master is 0 which is instance 1.
select * from v$gcspfmaster_info where object_id=6099472;
---------- ---------- -------------- --------------- ------------
         0    6099472              0           32767            1
REM Opens are still increasing but remastering has already occurred.
select * from x$object_affinity_statistics where object=6099472;
ADDR                   INDX    INST_ID     OBJECT       NODE      OPENS
---------------- ---------- ---------- ---------- ---------- ----------
FFFFFFFF7C05AF78          2          1    6099472          1      42335
FFFFFFFF7C05AF78          3          1    6099472          2          1
REM LMON trace files are also indicating transfer of that pkey. Notice that pkey is same as object_id
*** 2010-03-23 10:41:57.355
Begin DRM(411) - transfer pkey 6099472 to 0 oscan 0.0
 ftd received from node 1 (4/0.30.0)
 all ftds received
REM few seconds later, opens have been reset again.
select * from x$object_affinity_statistics where object=6099472;
ADDR                   INDX    INST_ID     OBJECT       NODE      OPENS
---------------- ---------- ---------- ---------- ---------- ----------
FFFFFFFF7C05BFA8          0          1    6099472          1       7437
REM Still the master is instance 1. 
select * from v$gcspfmaster_info where object_id=6099472;
---------- ---------- -------------- --------------- ------------
         0    6099472              0           32767            1

Essentially, an object was remastered after excessive BL locking requests ( in a loose term accesses) on that index.

undo and affinity

Mastering of Undo segments differ from non-undo segment mastering. With non-undo segments, all the blocks are mastered by a hash technique spreading mastership among instances for a segment. Only after an instance opens BL locks aggressively on a segment that segment is mastered. But, for undo segments, Instance that activates an undo segment masters the segment immediately. This makes sense, since that undo segment will be used by the instance opening the segment in most cases. Parameter _gc_undo_affinity controls whether this dynamic undo remastering is enabled or not.

Since undo segments do not have real object_ids, a dummy object_ids over a value of 4294950912 is used. For example, undo segment 1 (with usn=1) will have an object_id of 4294950913, usn=2 will have object_id of 4294950914 etc.
[ 4294950912 = power(2,32) – power (2,14) = xFFFFC000 ]

select objecT_id, object_id-4294950912 usn, current_master, previous_master, 
remaster_cnt  from v$gcspfmaster_info where object_id>4294950912

---------- ---------- -------------- --------------- ------------
4294950913          1              0           32767            1
4294950914          2              0           32767            1
4294950915          3              0           32767            1
4294950916          4              0           32767            1
4294950917          5              0           32767            1
4294950918          6              0           32767            1
4294950919          7              0           32767            1
4294950920          8              0           32767            1
4294950921          9              0           32767            1
4294950922         10              0           32767            1
4294950923         11              1           32767            1
4294950924         12              1           32767            1
4294950925         13              1           32767            1

REM Notice that usno 0 is in both instances. That is system undo segment.
REM As you can see below first 10 undo segments are mastered by node 1 is next 3 are mastered by instance 2.

select inst_id, usn, gets from gv$rollstat where usn <=13
order by inst_id, usn
   INST_ID        USN       GETS
---------- ---------- ----------
         1          0       3130
         1          1     108407
         1          2      42640
         1          3      43985
         1          4      41743
         1          5     165166
         1          6      43485
         1          7     109044
         1          8      23982
         1          9      39279
         1         10      48552
         2          0       4433
         2         11     231147
         2         12      99785
         2         13      1883

I was not successful in triggering another undo segment remastering event. I created one active transaction generating 200K undo blocks in one node, another node was reading that table and I can see enormous waits for those undo blocks. But, I didn’t see any remastering events related to that undo segment. Not sure why it did not work, may be the conditions for the undo segment remastering is different.

[ PS: I am able to manually remaster the undo segment using lkdebug command discussed below: So, code must be remastering the undo segments automatically too, but may be some other conditions must be met.

  1* select * from v$gcspfmaster_info where object_id=431+4294950912
---------- ---------- -------------- --------------- ------------
         0 4294951343              0           32767            1
Oradebug lkdebug –m pkey 4294951343

* kjdrchkdrm: found an RM request in the request queue
  Transfer pkey 4294951343 to node 1
*** 2010-03-24 12:47:29.011
Begin DRM(520) - transfer pkey 4294951343 to 1 oscan 0.1
 ftd received from node 0 (4/0.31.0)
 all ftds received

select * from v$gcspfmaster_info where object_id=431+4294950912
SQL> /
---------- ---------- -------------- --------------- ------------
         0 4294951343              1               0            2

I am not preaching that you should modify these undocumented parameters. Far from it. Understand the parameters, if you run in to wait events such as ‘gc remaster’, ‘gcs freeze for instance reconfiguration’, understand whether the default values are quite low. Work with support and see if this can be tuned.

Manual remastering

You can manually remaster an object with oradebug command :
oradebug lkdebug -m pkey

This enqueues an object remaster request. LMD0 and LMON completes this request
*** 2010-01-08 23:25:54.948
* received DRM start msg from 1 (cnt 1, last 1, rmno 191)
Rcvd DRM(191) Transfer pkey 6984154 from 0 to 1 oscan 0.0
 ftd received from node 1 (8/0.30.0)
 ftd received from node 0 (8/0.30.0)
 ftd received from node 3 (8/0.30.0)
Current_master starts from 0. 
 1* select * from v$gcspfmaster_info where object_id=6984154
SQL> /
---------- ---------- -------------- --------------- ------------
         0    6984154              1               0            2
SQL> oradebug lkdebug -m pkey 6984154
Statement processed.
SQL>  select * from v$gcspfmaster_info where object_id=6984154
---------- ---------- -------------- --------------- ------------
         0    6984154              2               1            3


In summary, remastering is a great feature. It is a pity that some times, we fall victims of the side effects. So, if you run in to issues with remastering, don’t disable it, but see if you can tune those parameter upwards so as to control the remastering events. If you stil want to disable DRM completely, I would recommend setting _gc_affinity_limit and _gc_affinity_minimum to much higher value, say 10Million. Setting the parameter _gc_affinity_time to 0 will completely disable DRM, but that also means that you can not manually remaster objects. Further, Arup mentioned that x$object_affinity_statistics is not maintained if DRM is disabled.

Again, these are undocumented parameters. Before you change these parameters make sure that Oracle Support agrees with you.

Update 1:

From 11g onwards, affinity management renamed to policy management. For example, x$object_affinity_statistics is renamed to x$object_policy_statistics. Similarly, initialization parameters are renamed too:Parameter _gc_affinity_limit is renamed to _gc_policy_limit; Parameter _gc_affinity_time is renamed to _gc_policy_time; A new view v$policy_history is introduced and all rows with policy_event = ‘initiate_affinity’ are associated with DRM events.
Other details about the blog still holds good, except that by default _gc_policy_limit is lowered to 1500 which would mean that, by theory, 11g may have more DRM events. YMMV.

 1* select * from  v$policy_history
---------- -------------------- -------------- ----------------------  --------------------
         2 glru_on                           0                      1  10/15/2010 10:58:28
         2 glru_on                           0                      1  10/15/2010 11:21:32
         2 initiate_affinity             74809                      1  10/15/2010 13:27:44

[ Many Thanks Arup Nanda and Michael Möller (aka “M2”) for reviewing this blog entry, they contributed heavily to some of my discussion. But, any mistakes in this document is solely of mine. ]
This blog is available in a pdf format from Dynamic_remastering_RAC_orainternals.

Posted in Oracle database internals, Performance tuning, Presentations, RAC | Tagged: , , , , , , , , , , , , , , , , , , | 31 Comments »

RMOUG 2010: My presentations

Posted by Riyaj Shamsudeen on February 17, 2010

It is very disappointing to me that I had to cancel my trip to RMOUG training days. I am sick and was not able to catch the flight due to that.

But, I can always share my presentations here. I had two presentations planned in this training day and can be accessed as below:

Advanced RAC troubleshooting

Why optimizer hates my sql

RMOUG training days audience: Please accept my sincere apologies.

Posted in CBO, EBS11i, Oracle database internals, Performance tuning, Presentations, RAC | Tagged: , , , , , , , , , , | Leave a Comment »


Posted by Riyaj Shamsudeen on December 31, 2009

I will be presenting on two topics in HOTSOS’ 2010, an Oracle performance focused conference, in my home town Dallas, Texas. You can read about my presentation topics in HOTSOS’ 2010 Riyaj . This symposium is a valued conference for performance engineers, DBAs and developers who are interested to know learn about performance. There are many great speakers presenting in this conference and the main page for this conference is HOTSOS ‘2010 . BTW, My friend Alex Gorbachev interviewed Gary Goodman and posted a video in his blog also.

Tanel Poder is conducting the HOTSOS training day this year. You can’t miss his training day and I heard that he is working on a MOTS (Mother Of all Tuning Script) and planning to release that in HOTSOS ‘2010.

On behalf of the Dallasites, I invite you to visit Dallas and attend this great conference.

I wish you a Happy and prosperous New Year ‘2010!

Posted in Performance tuning, Presentations | Tagged: , , | Leave a Comment »

RAC performance tuning: Understanding Global cache performance

Posted by Riyaj Shamsudeen on December 23, 2009

Global cache performance metrics are not correctly measured. It is not understood clearly either. There are even few blogs and web pages disseminating incorrect information. This blog entry is an attempt to offer few methods and scripts to understand global cache performance.

Always review all instances

It is very important to review the performance metrics from all instances in that RAC cluster, not just one instance that you are connected. If you have access to AWR reports, then it is critical to generate AWR reports (or statspack reports) from all instances. But, the problem is that, DBAs tend to generate AWR reports after logging in to each instance iteratively, enter couple of parameters and then reports are generated. Not exactly a convenient practice.

  REM connect to each instance separately, type in the beginning snap_id and ending snap_id for each node etc..
   sqlplus mydba@proddb1 
   sqlplus mydba@proddb2
   sqlplus mydba@proddb3


There are few issues with this approach. It is a cumbersome practice if the instance count is higher. In addition to that, all of AWR reports are, in turn, accessing underlying AWR tables. Physically, rows from all instances are together in the same block and so, by executing these reports connecting to various instances, Global cache traffic is increased. If the database is suffering from Global cache (GC) performance issues then generating reports connecting to various instances is probably not a grand idea.

Keep Reading

Posted in Oracle database internals, Performance tuning, RAC | Tagged: , , , , , , , , | 26 Comments »

Is plan_hash_value a final say?

Posted by Riyaj Shamsudeen on September 13, 2009

I was reviewing a performance issue with a client recently. Problem is that increased global cache waits causing application slowdown affecting few critical business functions. Using one of my script gc_traffic.sql > and graphing the results with Excel spreadsheet, it is established that there is a marked increase in GC traffic today compared to week earlier. Similar jobs runs every day and so comparing two week days is sufficient to show the GC traffic increase. Graph is between total blocks and AWR snap time in 30 minutes interval. [Click the picture below to review the graph clearly.]

Identifying the object creating this increased GC traffic is essential to identify root cause. We were able to quickly determine that this increase in GC traffic was localized around few SQL statements using ADDM and AWR reports. We decided to focus on one SQL with an obvious increase in elapsed time compared to prior week. So, first question asked, is there a change in the plan? plan_hash_value was reviewed and quickly determined that there is no change in the plan_hash_value.

Keep Reading

Posted in CBO, Performance tuning, RAC, Uncategorized | Tagged: , , , , , | 7 Comments »

Shared pool freelists (and durations)

Posted by Riyaj Shamsudeen on August 9, 2009

My earlier blog about shared pool duration got an offline response from one of my reader:
” So, you say that durations aka mini-heaps have been introduced from 10g onwards. I have been using Steve Adams’ script shared_pool_free_lists.sql. Is that not accurate anymore?”


Shared pool free lists

I have a great respect for Steve Adams . In many ways, he has been a great virtual mentor and his insights are so remarkable.

Coming back to the question, I have used Steve’s script before and it is applicable prior to Oracle version 9i. In 9i, sub-heaps were introduced. Further, shared pool durations were introduced in Oracle version 10g. So, his script may not be applicable from version 9i onwards. We will probe this further in this blog.

This is the problem with writing anything about internals stuff, they tend to change from version to version and In many cases, our work can become obsolete in future releases(including this blog!).

Keep Reading

Posted in 11g, Oracle database internals, Performance tuning, shared_pool | Tagged: , , , , , , , , , , , | 2 Comments »

ORA-4031 and Shared Pool Duration

Posted by Riyaj Shamsudeen on August 6, 2009

After reading my earlier post on shared pool A stroll through shared pool heap , one of my client contacted me with an interesting ORA-4031 issue. Client was getting ORA-4031 errors and shared pool size was over 4GB ( in a RAC environment). Client DBA queried v$sgastat to show that there is plenty of free memory in the shared pool. We researched the issue and it is worth blogging. Client DBA was confused as to how there can be ORA-4031 errors when the shared pool free memory is few GBs.

Heapdump Analysis

At this point, it is imperative to take heapdump in level 2 and Level 2 is for the shared pool heap dump. [ Please be warned that it is not advisable to take shared pool heap dumps excessively, as that itself can cause performance issue. During an offline conversation, Tanel Poder said that heapdump can freeze instance as his clients have experienced.]. This will create a trace file in user_dump_dest destination and that trace file is quite useful in analyzing the contents of shared pool heap. Tanel Poder has an excellent script heapdump_analyzer . I modified that script adding code for aggregation at hea, extent and type levels to debug this issue further and it is available as heapdump_dissect.ksh . ( with a special permission from Tanel to publish this script.)

Shared pool review

You can read much more about shared pool in my earlier blog entry posted above. Just as a cursory review, shared pool is split in to multiple sub heaps. In 10g, each of those sub heaps are divided in to even smaller sub heaps, let’s call it mini-heaps. For example, in this specific database, there are three sub heaps. Each of those sub heaps are further split in to four mini-heaps (1,0), (1,1), (1,2) and (1,3) each.

	sga heap(1,0)
	sga heap(1,1)
	sga heap(1,2)
	sga heap(1,3)

	sga heap(2,0)
	sga heap(2,1)
	sga heap(2,2)
	sga heap(2,3)

	sga heap(3,0)
	sga heap(3,1)
	sga heap(3,2)
	sga heap(3,3)	

Keep Reading

Posted in Oracle database internals, Performance tuning, RAC | Tagged: , , , , , , , , , | 16 Comments »

RAC, parallel query and udpsnoop

Posted by Riyaj Shamsudeen on June 20, 2009

I presented about various performance myths in my ‘battle of the nodes’ presentation. One of the myth was that how spawning parallel query slaves across multiple RAC instances can cause major bottleneck in the interconnect. In fact, that myth was direct result of a lessons learnt presentation from a client engagement. Client was suffering from performance issues with enormous global cache waits running in to 30+ms average response time for global cache CR traffic and crippling application performance. Essentially, their data warehouse queries were performing hundreds of parallel queries concurrently with slaves spawning across three node RAC instances.

Of course, I had to hide the client details and simplified using a test case to explain the myth. Looks like either a)my test case is bad or b) some sort of bug I encountered in version c) I made a mistake in my analysis somewhere. Most likely it is the last one :-(. Greg Rahn questioned that example and this topic deserves more research to understand this little bit further. At this point, I don’t have and database is in and so we will test this in


UDP is one of the protocol used for cache fusion traffic in RAC and it is the Oracle recommended protocol. In this article, UDP traffic size must be measured. Measuring Global cache traffic using AWR reports was not precise. So, I decided to use a dtrace tool kit tool:udpsnoop.d to measure the traffic between RAC nodes. There are two RAC nodes in this setup. You can read more about udpsnoop.d. That tool udpsnoop.d can be downloaded from dtrace toolkit . Output of this script is of the form:

PID        LADDR           LPORT           DR         RADDR           RPORT                 SIZE
---------- --------------- --------------- ---------- --------------- --------------- -----------
15393      38395           ->      40449                 8240

Keep Reading

Posted in Oracle database internals, Performance tuning, RAC | Tagged: , , , , , , , , , , | 15 Comments »


Get every new post delivered to your Inbox.

Join 264 other followers