Oracle database internals by Riyaj

December 23, 2009

RAC performance tuning: Understanding Global cache performance

Global cache performance metrics are not correctly measured. It is not understood clearly either. There are even few blogs and web pages disseminating incorrect information. This blog entry is an attempt to offer few methods and scripts to understand global cache performance.

Always review all instances

It is very important to review the performance metrics from all instances in that RAC cluster, not just one instance that you are connected. If you have access to AWR reports, then it is critical to generate AWR reports (or statspack reports) from all instances. But, the problem is that, DBAs tend to generate AWR reports after logging in to each instance iteratively, enter couple of parameters and then reports are generated. Not exactly a convenient practice.

  REM connect to each instance separately, type in the beginning snap_id and ending snap_id for each node etc..
   sqlplus mydba@proddb1
   @$ORACLE_HOME/rdbms/admin/awrrpt.sql
   exit;
   sqlplus mydba@proddb2
   @$ORACLE_HOME/rdbms/admin/awrrpt.sql
   exit;
   sqlplus mydba@proddb3
   @$ORACLE_HOME/rdbms/admin/awrrpt.sql
   exit;

There are few issues with this approach. It is a cumbersome practice if the instance count is higher. In addition to that, all of AWR reports are, in turn, accessing underlying AWR tables. Physically, rows from all instances are together in the same block and so, by executing these reports connecting to various instances, Global cache traffic is increased. If the database is suffering from Global cache (GC) performance issues then generating reports connecting to various instances is probably not a grand idea.

I have created few basic scripts to generate AWR reports from all instances in a single step. Script awrrpt_all_gen.sql queries AWR tables and gv$instance to access last snap_id and various instance_ids. Then this script generates AWR reports for all instances in that cluster in the current directory, for the last AWR snap. This script is quite handy while debugging RAC performance issues. Now, we have AWR reports from all instances in one directory and we can easily grep these files to understand metrics in various instances.

@awrrpt_all_gen.sql

awrrpt_all_gen.sql v1.00 by Riyaj Shamsudeen @orainternals.com

To generate AWR Report from all RAC instances concurrently.
Creates reports using last two snap_ids.

...Generating awrrpt_all.sql script.... Please wait....
...Completed script generation.

Executing awrrpt_all.sql to generate AWR reports.
...Generates AWR reports with file name format awrrpt___.txt for each instance.
...Please wait for few minutes...

...AWR report created for instance 1. Please wait..
...AWR report created for instance 2. Please wait..
...AWR report created for instance 3. Please wait..
...AWR report created for instance 4. Please wait..

AWR reports created.

If you would like ability to choose the snap_ids, then use the script awrrpt_all_range_gen.sql

Use granular approach to GC measurements

It is important to measure Global cache performance metrics correctly, especially those damned averages. Incorrect measurements of averages can lead to faulty analysis. For example, following query is retrieving average global cache CR receive time from gv$sysstat view. Problem with this SQL statement is that this statement is retrieving average global cache receive time from the restart of the instances. Global cache performance is quite sensitive to workload and querying over a wide time range, as in this query, can lead to faulty analysis.


REM This is really a bad idea!!

select b1.inst_id, b2.value "RECEIVED",
b1.value "RECEIVE TIME",
((b1.value / b2.value) * 10) "AVG RECEIVE TIME (ms)"
from gv$sysstat b1, gv$sysstat b2
where b1.name = 'gc current block receive time' and
b2.name = 'gc current blocks received' and b1.inst_id = b2.inst_id;

As an analogy, above query is equivalent to a patient-Doctor conversation:

Patient: Doctor, I don’t feel good. I feel that I am having fever.
Doctor : Nope. Your average body temperature from the beginning of this year is around 98. So, you are doing excellent.

It is essential to review the global cache performance metrics at a granular level. Use AWR reports or statspack reports for a short period of time. Better yet, if you want to quickly review GC performance metrics, you can use my script gc_traffic_print.sql. This script captures various statistics relevant to Global cache performance in to a PL/SQL table, sleeps for N seconds (default N=60) and re-captures the statistics after the sleep. Then the script prints all the relevant details in easily readable tabular format. Very useful in quickly reviewing the Global cache performance metrics.

SQL> @gc_traffic_print.sql

gc_traffic_print.sql v1.20 by Riyaj Shamsudeen @orainternals.com

...Calculating GC Rx and Tx timing and blocks..
...Default collection period is 60 seconds.... Please wait for at least 60 seconds...
Enter value for sleep: 30
---------|--------------|---------|----------------|----------|---------------|---------------|-------------|
Inst     | CR blocks Rx | CR time |  CUR blocks Rx | CUR time |  CR blocks Tx | CUR blocks Tx |Tot blocks   |
---------|--------------|---------|----------------|----------|---------------|---------------|-------------|
1        |         29288|    12.17|          14914|      4.58|          27059|           17439|        88700|
2        |         21201|     5.75|          12835|      5.19|          26831|           16741|        77608|
3        |         22942|     4.64|          11751|      3.82|          23918|            9205|        67816|
4        |         32075|     4.78|          19260|      4.55|          31046|           19331|       101712|
---------|--------------|---------|----------------|----------|---------------|---------------|-------------|

PL/SQL procedure successfully completed.

Column key for the above script is:
CR block RX : GC CR blocks received
CR time : Average GC CR receive time
CUR blocks RX: GC CUR blocks received
CUR time: Average GC CuR receive time
CR blocks TX: GC CR blocks transmitted
CUR blocks TX: GC CUR blocks transmitted
tot blocks: Sum of transmitted + received for both CR and CUR traffic

Breakup: global cache performance

Global cache processing performance can be broken down to its sub components ( excluding wire latency, but that can be calculated). Breakup of this metrics must be done at a granular level also. We can use yet another script gc_traffic_processing.sql to review the performance metrics for the past N seconds. For example, in the output printed below, Global cache CR processing time is broken in to various sub-components of global cache processing. Remember that these performance metrics are in the SEND side of Global cache performance metrics. Meaning, inefficient global cache processing in one instance affects other instances' global cache receive time, but not the self. For example, in the table above, GC CR receive performance was at 12.1ms for instance 1. That is due to another instance (probably node 4 in this case) suffering from high GC processing time (more specifically high 'CR block flush time' in node 4). It is far more accurate to review global cache performance issues in a granular basis. Caution though, try not to specify at least a minute so that these numbers are consistent.

@gc_traffic_procssing.sql
gc_traffic_processing.sql v1.00 by Riyaj Shamsudeen @orainternals.com

...Prints various timing related information for the past N seconds
...Default collection period is 60 seconds.... Please wait for at least 60 seconds...
Enter value for sleep:

---------|-----------|---------|-----------|----------|------------|------------|------------|----------|
Inst     | CR blk Tx | CR bld  | CR fls tm | CR snd tm| CUR blk TX | CUR pin tm | CUR fls tm |CUR snd tm|
---------|-----------|---------|-----------|----------|------------|------------|------------|----------|
1        |      62732|       .2|       1.25|       .21|       48920|        .96|         .22|        .22|
2        |      46037|      .13|        .76|        .2|       42019|        .68|         .17|         .2|
3        |      72237|      .63|        .79|       .19|       22697|        .44|         .29|        .23|
4        |     113816|      .13|       2.04|       .15|       51020|        .87|         .12|        .13|
--------------------------------------------------------------------------------------------------------

PL/SQL procedure successfully completed.

Here is the column definitions for the script output:
CR blk TX: GC CR blocks served
CR bld : Average GC CR build time/CR block served
CR fls tm: Average GC CR flush time/CR block served
CR snd tm: Average GC CR send time/CR block served
CUR blk TX: GC CUR blocks served
CUR pin tm: Average GC Current pin time /CUR block served
CUR fls tm : Average GC Current flush time/CUR block served
CUR snd tm: Average GC current send time/CUR block served

Review histograms for GC events

Averages can be misleading. Revealing performance issues underneath the average is a very important task to do. For example, we see that 'gc cr block receive time' average is high. But, let's review the histogram for the gc events related to Global cache receive time. My script gc_event_histogram.sql is quite handy to review event histograms. This script provides a difference in performance statistics between two samples and prints them in a readable format. In this case, node 1 is has 8578 waits for the event 'gc cr block 2-way' under 2ms and the sessions in instance 1 had 9851 waits between 2-4ms. Only few waits over 16ms.

@gc_event_histogram.sql

gc_event_histogram.sql v1.0 by Riyaj Shamsudeen @orainternals.com

Default collection period is sleep seconds. Please wait..
Enter value for event: gc cr block 2-way
Enter value for sleep: 60
---------|-----------------------|----------------|----------|
Inst id  | Event                 |wait time milli |wait cnt  |
---------|-----------------------|----------------|----------|
1        |gc cr block 2-way      |               1|        98|
1        |gc cr block 2-way      |               2|      8578|
1        |gc cr block 2-way      |               4|      9851|
1        |gc cr block 2-way      |               8|      2291|
1        |gc cr block 2-way      |              16|       117|
1        |gc cr block 2-way      |              32|         6|
1        |gc cr block 2-way      |              64|         1|
1        |gc cr block 2-way      |             128|         2|
1        |gc cr block 2-way      |             256|         0|
1        |gc cr block 2-way      |             512|         0|
1        |gc cr block 2-way      |            1024|         0|
1        |gc cr block 2-way      |            2048|         0|
1        |gc cr block 2-way      |            4096|         0|
1        |gc cr block 2-way      |            8192|         0|
1        |gc cr block 2-way      |           16384|         0|
2        |gc cr block 2-way      |               1|       155|
2        |gc cr block 2-way      |               2|     10792|
2        |gc cr block 2-way      |               4|     14201|
2        |gc cr block 2-way      |               8|      3887|
2        |gc cr block 2-way      |              16|       204|
...

But, we also need to see 3-way waits. Let's review 3-way waits with the above script. We will simply supply 'gc cr block 3-way' as the wait event. In this case, we see that there are 12,062 waits over between 4-8ms. Of course, this must be improved.


@gc_event_histogram.sql

gc_event_histogram.sql v1.0 by Riyaj Shamsudeen @orainternals.com

Default collection period is sleep seconds. Please wait..
Enter value for event: gc cr block 3-way
Enter value for sleep: 60
---------|-----------------------|----------------|----------|
Inst id  | Event                 |wait time milli |wait cnt  |
---------|-----------------------|----------------|----------|
1        |gc cr block 3-way      |               1|         3|
1        |gc cr block 3-way      |               2|       320|
1        |gc cr block 3-way      |               4|     17017|
1        |gc cr block 3-way      |               8|     12062|
1        |gc cr block 3-way      |              16|      1725|
1        |gc cr block 3-way      |              32|       113|
1        |gc cr block 3-way      |              64|         0|
1        |gc cr block 3-way      |             128|         0|
1        |gc cr block 3-way      |             256|         0|
1        |gc cr block 3-way      |             512|         0|
1        |gc cr block 3-way      |            1024|         0|
1        |gc cr block 3-way      |            2048|         0|
1        |gc cr block 3-way      |            4096|         0|
1        |gc cr block 3-way      |            8192|         0|
...

In essence, these four scripts will be useful in debugging Global cache performance issues. Please let me know if you encounter errors in these scripts. We will use these scripts in future discussions to show how we resolved few RAC performance issues.

November 24, 2009

UKOUG 2009

Filed under: Presentations, RAC — orainternals @ 8:46 pm

Well, it is that time of the year !

Few Presentations

I will be presenting two papers in Birmingham, UK Oracle User Group Technology conference . My sessions are :


  1. Tuesday 10:55-11:55: Battle of the nodes: RAC Performance myths .

  2. Wednesday 15:15-16:00: Advanced performance issues: A close encounter with real world performance issues .

I will upload new versions of these presentations in this blog, few days after the conference. See you in Birmingham.

Update: Papers and presentations can be downloaded following the links below:
1. Battle of the nodes RAC performance myths ppt

2. Battle of the nodes RAC performance myths doc

3. Close encounter with real world and odd perf issues ppt

September 13, 2009

Is plan_hash_value a final say?

I was reviewing a performance issue with a client recently. Problem is that increased global cache waits causing application slowdown affecting few critical business functions. Using one of my script gc_traffic.sql > and graphing the results with Excel spreadsheet, it is established that there is a marked increase in GC traffic today compared to week earlier. Similar jobs runs every day and so comparing two week days is sufficient to show the GC traffic increase. Graph is between total blocks and AWR snap time in 30 minutes interval. [Click the picture below to review the graph clearly.]

Identifying the object creating this increased GC traffic is essential to identify root cause. We were able to quickly determine that this increase in GC traffic was localized around few SQL statements using ADDM and AWR reports. We decided to focus on one SQL with an obvious increase in elapsed time compared to prior week. So, first question asked, is there a change in the plan? plan_hash_value was reviewed and quickly determined that there is no change in the plan_hash_value.

Little bit of history, there were few changes performed by the vendor over the weekend as a part of few bug fixes. Vendor’s argument was that since there is no change to the plan_hash_value, SQL access plan did not change and so, this can’t be due to vendor changes. My Client’s argument was that there were no changes to the environment and problem started after the application changes.

There are many different things that can go wrong without changes to the execution plan. We can ignore those conditions for now since (a) there has been no changes to the environment (b) no visible changes to the data (c) no error message and average CR recv time is consistent with prior weeks. Well, Let’s cut to the chase. It boiled down to a question “Can SQL plan change without a change in plan_hash_value?”. What do you think? Please answer in the poll below.

plan_hash_value and hash_value

Hash_value of a SQL statement is generated from the text of an SQL statement and plan_hash_value is generated from the execution plan of that SQL statement[ More accurately, from that child cursors' execution plan and exactly what is involved in generating plan_hash_value is not published]. It is a general belief that plan_hash_value will change even if there is a slightest change in the execution plan. But, that is not always the case!

Test case

We will use a small table to explore this issue.

prompt Test case #1: Initial test case with index on columns (n1, n2).
prompt ==========
create table t1 (n1 number, n2 number, v1 varchar2(100));
create index t1_n1 on t1(n1,n2);

explain plan for select * from t1 where n1=:b1 and n2=:b2;
select * from table (dbms_xplan.display);
prompt plan hash value in this case is 626762252

Plan hash value: 626762252
-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    53 |  4134 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |    53 |  4134 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1 |     1 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("N1"=TO_NUMBER(:B1) AND "N2"=TO_NUMBER(:B2))

So, we got the plan_hash_value as highlighted above. In the following test case we will recreate the index reordering the columns as (n2, n1).

prompt Test case #2: Index with re-ordered columns
prompt ==========
drop index t1_n1;
create index t1_n1 on t1(n2,n1);

Plan hash value: 626762252

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    53 |  4134 |     9   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |    53 |  4134 |     9   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1 |  5298 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("N2"=TO_NUMBER(:B2) AND "N1"=TO_NUMBER(:B1))

Notice that in the test result above filter predicate changed reflecting index column reordering. But the plan_hash_value did not change. Point is that execution plan can change without a change in plan_hash_value (due to change in the underlying tables).

Let’s modify that index dropping a column from the index in the next test case.

prompt Test case #3: dropping a column from index t1_n1.

drop index t1_n1;
create index t1_n1 on t1(n1);

explain plan for select * from t1 where n1=:b1 and n2=:b2;

Plan hash value: 626762252

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    53 |  4134 |    41   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID| T1    |    53 |  4134 |    41   (3)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1 |  2119 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N2"=TO_NUMBER(:B2))
   2 - access("N1"=TO_NUMBER(:B1))

We see that predicates changed but the plan_hash_value did not change. In the test case below, we will modify index to be a function based index and test this SQL statement. There are also few more self-evident test cases below.

prompt Test case #3: Index is a function based index. Still, no change in the plan_hash_value.

create index t1_n1 on t1(to_char(n1));
explain plan for select * from t1 where to_char(n1)=:b1 and n2=:b2;

Plan hash value: 626762252
-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    53 |  4134 |   127   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID| T1    |    53 |  4134 |   127   (1)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1 |  2119 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N2"=TO_NUMBER(:B2))
   2 - access(TO_CHAR("N1")=:B1)

prompt Test case #4: Different schema and same SQL. plan_hash_value did not change.

Plan hash value: 626762252
-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |     1 |    78 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID| T1    |     1 |    78 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1 |     3 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("N2"=TO_NUMBER(:B2))
   2 - access("N1"=TO_NUMBER(:B1))

plan_hash_value is also case sensitive for index names. In the test cases below, we will create case sensitive indices.

prompt Test case #5: Lower case index_name.. plan_hash_value changed.
create index "t1_n1" on t1(n1,n2);
explain plan for select * from t1 where n1=:b1 and n2=:b2;

Plan hash value: 2252949961
-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    53 |  4134 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |    53 |  4134 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | t1_n1 |     1 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

prompt Test case #6: Upper case index_name.. plan_hash_value did not change.
create index "T1_N1" on t1(n1,n2);
explain plan for select * from t1 where n1=:b1 and n2=:b2;
Plan hash value: 626762252

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    53 |  4134 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |    53 |  4134 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1 |     1 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

prompt Test case #7: Space in the index name changes plan_hash_value though.
create index "T1_N1 " on t1(n1,n2);
explain plan for select * from t1 where n1=:b1 and n2=:b2;
Plan hash value: 1377541522
--------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |    53 |  4134 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1     |    53 |  4134 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1  |     1 |       |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Summary

In summary, plan_hash_value is a very good indicator to see if the plan changed or not. Don’t get me wrong, I also use plan_hash_value, but in addition to comparing CPU time and elapsed time. Execution plan can change even when there is no change to the plan_hash_value. Salient points from these test cases are:

  1. Plan_hash_value is dependent upon partial execution plan not on complete execution plan.
  2. If a predicate is moved from filter_predicate to access_predicate or vice-versa, it doesn’t affect plan_hash_value.
  3. Changes in the parallelism of the queries does not affect plan_hash_value. For example, if a query used 4 parallel slaves today and 16 parallel slaves yesterday, that change is not visible through plan_hash_value. This behavior is expected as parallel_adaptive_multiuser can allocate slaves depending upon the state of that instance.
  4. Plan_hash_value is case-sensitive to index/table names. Also sensitive to white-space characters.
  5. Plan_hash_value is not sensitive to index types. For example, if the index type is a function based index as in our test case #4, as long as, index name did not change plan_hash_value will remain the same.
  6. Plan_hash_value is not sensitive to schema either. SQL statement accessing different schemas can have same plan_hash_value too.

Back to our problem. One of the index was recreated removing a column and caused optimizer to apply filter predicates at table level increasing number of accesses to table block tremendously, leading to more logical reads, more Global cache waits etc. This problem was amplified since this SQL was executed very frequently and concurrently from all RAC instances.

In my client’s defense, this application change was tested thoroughly. But, alas, test data chosen for this performance test was not probing this specific issue. This performance issue did not show up in development. Essentially, chosen data in the performance benchmark suite was not an excellent choice.

As they say in Gaelic language “Go raimh maith agat” to my client for allowing me to post this blog.
This can be read in traditional format in
plan_hash_value_and_gc
Update 1: Added the document and also corrected a typo.

August 6, 2009

ORA-4031 and Shared Pool Duration

After reading my earlier post on shared pool A stroll through shared pool heap , one of my client contacted me with an interesting ORA-4031 issue. Client was getting ORA-4031 errors and shared pool size was over 4GB ( in a RAC environment). Client DBA queried v$sgastat to show that there is plenty of free memory in the shared pool. We researched the issue and it is worth blogging. Client DBA was confused as to how there can be ORA-4031 errors when the shared pool free memory is few GBs.

Heapdump Analysis

At this point, it is imperative to take heapdump in level 2 and Level 2 is for the shared pool heap dump. [ Please be warned that it is not advisable to take shared pool heap dumps excessively, as that itself can cause performance issue. During an offline conversation, Tanel Poder said that heapdump can freeze instance as his clients have experienced.]. This will create a trace file in user_dump_dest destination and that trace file is quite useful in analyzing the contents of shared pool heap. Tanel Poder has an excellent script heapdump_analyzer . I modified that script adding code for aggregation at heap, extent and type levels to debug this issue further and it is available as heapdump_dissect.ksh . ( with a special permission from Tanel to publish this script.)

Shared pool review

You can read much more about shared pool in my earlier blog entry posted above. Just as a cursory review, shared pool is split in to multiple sub heaps. In 10g, each of those sub heaps are divided in to even smaller sub heaps, let’s call it mini-heaps. For example, in this specific database, there are three sub heaps. Each of those sub heaps are further split in to four mini-heaps (1,0), (1,1), (1,2) and (1,3) each.

	sga heap(1,0)
	sga heap(1,1)
	sga heap(1,2)
	sga heap(1,3)

	sga heap(2,0)
	sga heap(2,1)
	sga heap(2,2)
	sga heap(2,3)

	sga heap(3,0)
	sga heap(3,1)
	sga heap(3,2)
	sga heap(3,3)

One or more extents are allocated to these mini-heaps dynamically as these areas grow over the course of instance life until there is no more free memory to allocate. In this blog entry, let’s focus on the contents of these sub-heaps.
Following shows output of heapdump_dissect script at extent level aggregation. As you can see sga heap (1,0) is allocated with 34 extents of 16M each. [ 16M comes from _ksmg_granule_size.]

Extent level summary
------------------
Sub heap          Extent             Type              Allocation cmt   Size
---------------- -----------------   ---------------   --------------  --------
   sga heap(1,0)          EXTENT 0              free                   1792
   sga heap(1,0)          EXTENT 0              perm   perm            15096064
   sga heap(1,0)          EXTENT 0            R-free                   616
   sga heap(1,0)          EXTENT 0            R-perm   perm            1678560
   sga heap(1,0)          EXTENT 0        R-freeable   reserved stoppe 96
   sga heap(1,0)          EXTENT 1              free                   808
   sga heap(1,0)          EXTENT 1              perm   perm            15097048
   sga heap(1,0)          EXTENT 1            R-free                   1248
   sga heap(1,0)          EXTENT 1            R-perm   perm            1677928
   sga heap(1,0)          EXTENT 1        R-freeable   reserved stoppe 96
   sga heap(1,0)          EXTENT 2              free                   456
   sga heap(1,0)          EXTENT 2              perm   perm            15097400
   sga heap(1,0)          EXTENT 2            R-free                   1679176
   sga heap(1,0)          EXTENT 2        R-freeable   reserved stoppe 96
 ...
   sga heap(1,0)         EXTENT 34              free                   160
   sga heap(1,0)         EXTENT 34              perm   perm            15097696
   sga heap(1,0)         EXTENT 34            R-free                   1679176
   sga heap(1,0)         EXTENT 34        R-freeable   reserved stoppe 96

Duration: A KGH policy

It gets interesting here. Notice that all “sga heap(1,0)” chunks has allocation comment as “perm”. Allocation comment “perm” is passed for permanent chunks. In a nutshell, all permanent chunks are allocated from first mini-heap in each of these sub heaps i.e. sga heap(1,0), sga heap(2,0) and sga heap(3,0) and so on. They are only allocated in the first mini-heap and not in any other mini-heap in these sub-heaps. Another example: PL/SQL DIANA type chunks are allocated only in fourth mini-sub-heap (1,3) (2,3) and (3,3) [in this instance].

   sga heap(1,3)          freeable   PL/SQL DIANA    1318912
...
   sga heap(2,3)          freeable   PL/SQL DIANA    1675296
...
   sga heap(3,3)          freeable   PL/SQL DIANA    1458176

Point is that there is a new KGH policy based upon duration of a chunk. All chunks are classified based upon their type ( or more formally based upon the duration of that chunk) and the chunks with the same duration is allocated from the same shared pool mini-heaps. I think, this is a great idea, especially since only few heaps of shared pool tend to be over worked and that should not unnecessarily flush other parts of the sub-heap. Another important thing to consider is that shared pool extents can be de-allocated and allocated to other parts of SGA. For example, shared pool extents can be deallocated from shared pool and allocated to buffer cache to increase buffer cache size [ These reallocated chunks will be marked with a comment KGH:NO ACCESS ]. By keeping the perm chunks in the first mini-heap, extents can be deallocated quite easily, without the need to move the perm chunks to different areas.

So, What’s the problem?

Drawback is that these chunks are contained in that mini-heap. For example, let’s say that chunks with permanent duration must be allocated and there is no free space in the first mini-heap ( say sga heap(1,0) ) and if a new extent can’t be added to that mini-heap then ORA-4031 error is thrown even if there is plenty of free space in some other mini-heap ( say sga heap(1,1), sga heap(1,2) etc). So permanent chunks will be allocated only from sga heap(N, 0) [ where N is sub-heap id 1, 2,3... ] and if that mini-heap runs out of space, ORA-4031 will be thrown.

This is why client encountered ORA-4031 errors. Even though there was plenty of free memory in other pools, simply permanent chunks can not be allocated in the first mini-heap leading to ORA-4031 errors. Of course, other chunks in that mini-heap can not be deallocated either since those chunks are also permanent chunks. Chances of these errors occurring in other sub-heaps such as sga heap(N,1), sga heap(N,2) etc are less since recreatable/freeable chunks can be flushed to accommodate incoming requests. Quick resolution was to increase shared pool_size temporarily (until we can reduce perm chunk usage due to another issue, which will be resolved soon).

Parameter:_enable_shared_pool_durations

Above undocumented parameter controls this specific KGH policy and default is true. [Note that setting an undocumented parameter needs Oracle support blessing in a production database]. I tested this in my test database setting parameter _enable_shared_pool_durations=false and mini-heaps disappeared. If this parameter is false, then duration of a chunk is not considered. Following example shows that EXTENT 0 in sga heap(1,0) contains non-perm allocations if this parameter is set to FALSE.

EXTENT 0 addr=3CC00000
...
  Chunk 3cfdf23c sz=     4096    freeable  "sql area       "  ds=3CFE4E0C
  Chunk 3cfe023c sz=      540    recreate  "KQR PO         "  latch=41AE8FD4
...

Summary

In essence, duration enabled chunk allocation policy has been introduced in 10g [ I don't know exact version]. This is usually beneficial and ORA-4031 errors are possible even if there is plenty of free space in other mini-heaps.

This can be read in pdf format from Investigations:ORA-4031 and shared pool duration.

Also, there is a patent documents “duration based memory management” available in the web (patent US 6,253,226). Those documents provides in-depth analysis.

Update 1: Looks like these mini-heaps are formally called duration from an Oracle internal resource. There are exactly 4 durations in each sub-heap and each duration has its own free list. x$ksmsp.ksmchdur column indicates the duration of that chunk.

June 20, 2009

RAC, parallel query and udpsnoop

I presented about various performance myths in my ‘battle of the nodes’ presentation. One of the myth was that how spawning parallel query slaves across multiple RAC instances can cause major bottleneck in the interconnect. In fact, that myth was direct result of a lessons learnt presentation from a client engagement. Client was suffering from performance issues with enormous global cache waits running in to 30+ms average response time for global cache CR traffic and crippling application performance. Essentially, their data warehouse queries were performing hundreds of parallel queries concurrently with slaves spawning across three node RAC instances.

Of course, I had to hide the client details and simplified using a test case to explain the myth. Looks like either a)my test case is bad or b) some sort of bug I encountered in 9.2.0.5 version c) I made a mistake in my analysis somewhere. Most likely it is the last one :-( . Greg Rahn questioned that example and this topic deserves more research to understand this little bit further. At this point, I don’t have 9.2.0.5 and database is in 10.2.0.4 and so we will test this in 10.2.0.4.

udpsnoop

UDP is one of the protocol used for cache fusion traffic in RAC and it is the Oracle recommended protocol. In this article, UDP traffic size must be measured. Measuring Global cache traffic using AWR reports was not precise. So, I decided to use a dtrace tool kit tool:udpsnoop.d to measure the traffic between RAC nodes. There are two RAC nodes in this setup. You can read more about udpsnoop.d. That tool udpsnoop.d can be downloaded from dtrace toolkit . Output of this script is of the form:

PID        LADDR           LPORT           DR         RADDR           RPORT                 SIZE
---------- --------------- --------------- ---------- --------------- --------------- -----------
15393      1.1.59.192      38395           ->         2.1.59.192      40449                 8240
...

In the output above, PID 15393 sent an UDP packet of size 8240 from IP address 192.59.1.1 to 192.59.1.2 with local port as 38395 and remote port as 40449. As UDP traffic is flying between nodes, udpsnoop.d will print UDP traffic in to a file. So, I start collecting udpsnoop output before and end that collection immediately after our script is complete. Of course, we need to aggregate this data and play with it little bit, and so, will create an external table based upon this output file too.

- This is to read the file as an external table
drop table  external_udpsnoop;

create table external_udpsnoop
(
  c_uid varchar2(10),
  pid varchar2(10),
  laddr varchar2(15),
  lport varchar2(15),
  dr   varchar2(10),
  raddr varchar2(15),
  rport varchar2(15),
  c_size   varchar2(10),
  cmd  varchar2(15)
)
organization external (
  type oracle_loader
  default directory UTL_FILE_DIR
  access parameters (
      records delimited by newline
     badfile APPS_DATA_FILE_DIR:'xtern_rpt.bad'
      logfile APPS_DATA_FILE_DIR:'xtern_rpt.log'
      discardfile APPS_DATA_FILE_DIR:'xtern_rpt.dsc'
      fields terminated by whitespace
      missing field values are null
   )
  location ('udpsnoop_ra_join_2_8th_iter.lst')
 )
  reject limit 1000
/
REM Reject limit is high since there are few packets with some junk outputs, might be due unstable fbt in dtrace.

Test case #1: Hash join – slaves from all instances

First, let’s test for an hash join to show how UDP traffic is flowing between these ports. In this test case below, we use a big table and join a 10 Million rows to another 10 Million rows table. rownum is used so that script will complete in decent time, other wise, this ran for few hours before running in to errors. This selects few non-indexed columns so that SQL must do full table scan. SQL also has hint for 16 slaves from 2 instances.

Both instances will participate in this PQ operation as parallel_instance_group is set to ALL at session level.

alter session set parallel_instance_Group='ALL';
select /*+ parallel ( t1, 8,2)   parallel (t2, 8, 2)  */
min (t1.CUSTOMER_TRX_LINE_ID +t2.CUSTOMER_TRX_LINE_ID ) , max ( t1.SET_OF_BOOKS_ID+t2.set_of_books_id  ), avg(t1.SET_OF_BOOKS_ID +t2.set_of_books_id),
	avg( t1.QUANTITY_ORDERED + t2.QUANTITY_ORDERED ), max(t1.ATTRIBUTE_CATEGORY  ), max(t2.attribute1) , max(t1.attribute2)
from
  (select * from BIG_TABLE where rownum <=100000000)t1 ,
  (select * from BIG_TABLE where rownum <=100000000)t2
where t1.CUSTOMER_TRX_LINE_ID = t2.CUSTOMER_TRX_LINE_ID
;

PQ in operation

We will also use yet another script to make sure that SQL is indeed getting 8 slaves in each instance. I don’t remember, where I got this SQL to pull slaves information (may be Doug burns, Thanks!), but anyway, I modified that little bit for RAC.

   select
      s.inst_id,
      decode(px.qcinst_id,NULL,s.username,
            ' - '||lower(substr(s.program,length(s.program)-4,4) ) ) "Username",
      decode(px.qcinst_id,NULL, 'QC', '(Slave)') "QC/Slave" ,
      to_char( px.server_set) "Slave Set",
      to_char(s.sid) "SID",
      decode(px.qcinst_id, NULL ,to_char(s.sid) ,px.qcsid) "QC SID",
      px.req_degree "Requested DOP",
     px.degree "Actual DOP", p.spid
   from
     gv$px_session px,
     gv$session s, gv$process p
   where
     px.sid=s.sid (+) and
     px.serial#=s.serial# and
     px.inst_id = s.inst_id
     and p.inst_id = s.inst_id
     and p.addr=s.paddr
  order by 5 , 1 desc
SQL> /
   INST_ID Username     QC/Slave   Slave Set  SID    QC SID Requested DOP Actual DOP SPID
---------- ------------ ---------- ---------- ------ ------ ------------- ---------- ------------
         1 SYS          QC                    10931  10931                           7366
         1  - p000      (Slave)    1          10925  10931             16         16 24762
         1  - p001      (Slave)    1          10956  10931             16         16 24764
         1  - p002      (Slave)    1          10955  10931             16         16 24766
         1  - p003      (Slave)    1          10918  10931             16         16 24768
         1  - p004      (Slave)    1          10941  10931             16         16 24778
         1  - p005      (Slave)    1          10916  10931             16         16 24781
         1  - p006      (Slave)    1          10945  10931             16         16 24787
         1  - p007      (Slave)    1          10922  10931             16         16 24795
         2  - p000      (Slave)    1          10943  10931             16         16 16920
         2  - p001      (Slave)    1          10961  10931             16         16 16923
         2  - p002      (Slave)    1          10920  10931             16         16 16970
         2  - p003      (Slave)    1          10946  10931             16         16 16972
         2  - p004      (Slave)    1          10935  10931             16         16 16974
         2  - p005      (Slave)    1          10934  10931             16         16 16976
         2  - p006      (Slave)    1          10899  10931             16         16 16988
         2  - p007      (Slave)    1          10940  10931             16         16 16991

         1 SYS          QC                    10927  10927                           9476
         2  - pz99      (Slave)    1          10890  10927              2          2 17723
         1  - pz99      (Slave)    1          10912  10927              2          2 25875
20 rows selected.

From the output above 8 slaves are from instance 1 and 8 are from instance 2 allocated, with query coordinator ( 7366) running instance 1. Above sample also captured my own session accessing gv$ views ( Notice pz99, slaves for gv$ access use different PQ slave naming conventions from 10.2 onwards).

Results

We have established that slaves were allocated from multiple instances and udpsnoop is capturing UDP packet size between these instances. We also have external table mapping to that udpsnoop output file so as to query this data. Script completed in 1500 seconds. I mapped output of px slaves query above with UDP external table and here is the table I put together to show PQ slaves and their UDP size.

  INST Username     QC/Slave   Slave SID   QC SID Req. Actual   SPID  LADDR      RADDR      RPORT SIZE
                                 Set              DOP  DOP
------ ------------ ---------- ----- ----  ------ ---  ------- ------ ---------- ---------- ----- ---------
     1 SYS          QC               10931  10931               7366
     1  - p000      (Slave)    1     10925  10931  16       16 24762
     1  - p001      (Slave)    1     10956  10931  16       16 24764
     1  - p002      (Slave)    1     10955  10931  16       16 24766
     1  - p003      (Slave)    1     10918  10931  16       16 24768
     1  - p004      (Slave)    1     10941  10931  16       16 24778
     1  - p005      (Slave)    1     10916  10931  16       16 24781
     1  - p006      (Slave)    1     10945  10931  16       16 24787
     1  - p007      (Slave)    1     10922  10931  16       16 24795
     2  - p000      (Slave)    1     10943  10931  16       16 16920  2.1.59.192 2.1.59.192 62783 127068484
     2  - p001      (Slave)    1     10961  10931  16       16 16923  2.1.59.192 2.1.59.192 62783 126904080
     2  - p002      (Slave)    1     10920  10931  16       16 16970  2.1.59.192 2.1.59.192 62783 127767353
     2  - p003      (Slave)    1     10946  10931  16       16 16972  2.1.59.192 2.1.59.192 62783 128154145
     2  - p004      (Slave)    1     10935  10931  16       16 16974  2.1.59.192 2.1.59.192 62783 128096875
     2  - p005      (Slave)    1     10934  10931  16       16 16976  2.1.59.192 2.1.59.192 62783 126057311
     2  - p006      (Slave)    1     10899  10931  16       16 16988  2.1.59.192 2.1.59.192 62783 128228830
     2  - p007      (Slave)    1     10940  10931  16       16 16991  2.1.59.192 2.1.59.192 62783 127471579

Few important points to make here:

  1. In this case, all slaves running in node 2 were talking to one UDP port in node 1 (Port 62783). lsof shows that PID 7366 (Query co-ordinator) was listening on that UDP port. Point is that these slaves from node 2 were sending packats to the co-ordinator.
  2. Interestingly, there is no UDP traffic from instance 1 to 2. I think, that looks due to the nature of aggregation in the SQL.

Few minutes later..

Interestingly, few minutes later, while I was watching UDP traffic, few other processes kicked in and started generating UDP traffic. Re-queried the database again to see what these processes are. Query has allocated 16 more slaves, 8 more running from node 1 and 8 more running in node 2 [processes p008-p015 below]. These slaves were talking to a different UDP port 62789 which was also listened by coordinator process 7366 in node 1.

  INST Username     QC/Slave   Slave SID   QC SID Req. Actual   SPID  LADDR      RADDR      RPORT SIZE
                                 Set              DOP  DOP
------ ------------ ---------- ----- ----  ------ ---  ------- ------ ---------- ---------- ----- ---------
     1 SYS          QC               10931  10931               7366
     1  - p000      (Slave)    1     10925  10931  16       16 24762
     1  - p001      (Slave)    1     10956  10931  16       16 24764
     1  - p002      (Slave)    1     10955  10931  16       16 24766
     1  - p003      (Slave)    1     10918  10931  16       16 24768
     1  - p004      (Slave)    1     10941  10931  16       16 24778
     1  - p005      (Slave)    1     10916  10931  16       16 24781
     1  - p006      (Slave)    1     10945  10931  16       16 24787
     1  - p007      (Slave)    1     10922  10931  16       16 24795

     1  - p008      (Slave)    1     10958  10931  16       16 24798
     1  - p009      (Slave)    1     10938  10931  16       16 24818
     1  - p010      (Slave)    1     10965  10931  16       16 24836
     1  - p011      (Slave)    1     10953  10931  16       16 24838
     1  - p012      (Slave)    1     10946  10931  16       16 24841
     1  - p013      (Slave)    1     10929  10931  16       16 24843
     1  - p014      (Slave)    1     10919  10931  16       16 24853
     1  - p015      (Slave)    1     10942  10931  16       16 24855

     2  - p000      (Slave)    1     10943  10931  16       16 16920  2.1.59.192 2.1.59.192 62783 127068484
     2  - p001      (Slave)    1     10961  10931  16       16 16923  2.1.59.192 2.1.59.192 62783 126904080
     2  - p002      (Slave)    1     10920  10931  16       16 16970  2.1.59.192 2.1.59.192 62783 127767353
     2  - p003      (Slave)    1     10946  10931  16       16 16972  2.1.59.192 2.1.59.192 62783 128154145
     2  - p004      (Slave)    1     10935  10931  16       16 16974  2.1.59.192 2.1.59.192 62783 128096875
     2  - p005      (Slave)    1     10934  10931  16       16 16976  2.1.59.192 2.1.59.192 62783 126057311
     2  - p006      (Slave)    1     10899  10931  16       16 16988  2.1.59.192 2.1.59.192 62783 128228830
     2  - p007      (Slave)    1     10940  10931  16       16 16991  2.1.59.192 2.1.59.192 62783 127471579

     2  - p008      (Slave)    1     10911  10931  16       16 16993  2.1.59.192 2.1.59.192 62989 182053370
     2  - p009      (Slave)    1     10949  10931  16       16 16995  2.1.59.192 2.1.59.192 62989 182490908
     2  - p010      (Slave)    1     10951  10931  16       16 17000  2.1.59.192 2.1.59.192 62989 181899025
     2  - p011      (Slave)    1     10890  10931  16       16 17007  2.1.59.192 2.1.59.192 62989 181858294
     2  - p012      (Slave)    1     10972  10931  16       16 17009  2.1.59.192 2.1.59.192 62989 182104499
     2  - p013      (Slave)    1     10950  10931  16       16 17011  2.1.59.192 2.1.59.192 62989 182334705
     2  - p014      (Slave)    1     10902  10931  16       16 17013  2.1.59.192 2.1.59.192 62989 181611641
     2  - p015      (Slave)    1     10955  10931  16       16 17023  2.1.59.192 2.1.59.192 62989 181816693

In real life..

Summing this up, approximately, 2.4GB of UDP traffic was generated with one parallel query. Can you imagine what will happen if this inter-instance parallelism is allowed in data warehouse queries scanning many tables and partitions with many hash joins? Obviously, this has the effect of saturating Interconnect quickly and so performance will suffer. Our solution was to disallow parallel queries spawning multiple instances. All of them will be running within an instance boundary and effect of this change was immediately visible in the client environment. Back to our test, parallel_execution_message_size was set to 8192. Increasing this parameter will decrease elapsed time little bit, but we are worried about saturating interconnect traffic not just elapsed time of that query.

Further, I ran this query with parallel_instance_group set to one instance and then all instances, few times. Spawning across multiple instances, in fact, increases elapsed time too.

Parallel_instance_group :ALL
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.20       0.26          0         76          0           0
Fetch        2   1481.76    1509.95     701158         76          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4   1481.96    1510.22     701158        152          0           1

parallel_instance_group :INST1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.20       0.23          0         76          0           0
Fetch        2   1321.05    1331.67     701344         76          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4   1321.25    1331.90     701344        152          0           1

What about tha original example ?

Of course, let’s talk about that original example also. In this example, there was just one table and data was aggregated.

select /*+ parallel ( t1, 8,2)  */
min (t1.CUSTOMER_TRX_LINE_ID +t1.CUSTOMER_TRX_LINE_ID ) , max ( t1.SET_OF_BOOKS_ID+t1.set_of_books_id  ), avg(t1.SET_OF_BOOKS_ID +t1.set_of_books_id),
        avg( t1.QUANTITY_ORDERED + t1.QUANTITY_ORDERED ), max(t1.ATTRIBUTE_CATEGORY  ), max(t1.attribute1) , max(t1.attribute2)
from
  BIG_TABLE t1
;

Measuring, UDP traffic, It is visible that for this huge table, Interconnect traffic is kept minimal. It looks like, there are some optimization techniques for this single table aggregation query minimizing cache fusion traffic to a minimal level, just 2152. This convinces that, just the SQL in that myth is a bad example, but that myth is still a myth. I should have used original SQL joining multiple tables with hash join for my presentation, but as a consultant, I have a responsibility to keep clients information confidential and protect. At the end of the day, they pay for my bread.

   INST_ID Username     QC/Slave   Slave Set  SID    QC SID Requested DOP Actual DOP SPID		  Size
---------- ------------ ---------- ---------- ------ ------ ------------- ---------- ------------ -------
         1 SYS          QC                    10933  10933                           3314
         1  - p000      (Slave)    1          10958  10933             16         16 24762
         1  - p001      (Slave)    1          10948  10933             16         16 24764
         1  - p002      (Slave)    1          10953  10933             16         16 24766
         1  - p003      (Slave)    1          10925  10933             16         16 24768
         1  - p004      (Slave)    1          10916  10933             16         16 24778
         1  - p005      (Slave)    1          10938  10933             16         16 24781
         1  - p006      (Slave)    1          10951  10933             16         16 24787
         1  - p007      (Slave)    1          10946  10933             16         16 24795

         2  - p000      (Slave)    1          10949  10933             16         16 16920            2152
         2  - p001      (Slave)    1          10937  10933             16         16 16923            2152
         2  - p002      (Slave)    1          10946  10933             16         16 16970            2152
         2  - p003      (Slave)    1          10956  10933             16         16 16972            2152
         2  - p004      (Slave)    1          10902  10933             16         16 16974            2152
         2  - p005      (Slave)    1          10981  10933             16         16 16976            2152
         2  - p006      (Slave)    1          10899  10933             16         16 16988            2152
         2  - p007      (Slave)    1          10927  10933             16         16 16991            2152

         1 SYS          QC                    10945  10945                           3527
         1  - pz99      (Slave)    1          10942  10945              2          2 25875
         2  - pz99      (Slave)    1          10962  10945              2          2 17723           72344

Summary

In summary, having too many parallel query slaves spawning across multiple instances can cripple interconnect. There are some optimization techniques that seems to help in the case of single table aggregation and of course, that must be considered as an exception. I have modified the presentation little bit below, but will do a second and complete update on this presentation later:

Battle of the nodes RAC performance myths doc
Battle of the nodes RAC performance myths ppt

Blog at WordPress.com.