Oracle database internals by Riyaj

September 13, 2009

Is plan_hash_value a final say?

I was reviewing a performance issue with a client recently. Problem is that increased global cache waits causing application slowdown affecting few critical business functions. Using one of my script gc_traffic.sql > and graphing the results with Excel spreadsheet, it is established that there is a marked increase in GC traffic today compared to week earlier. Similar jobs runs every day and so comparing two week days is sufficient to show the GC traffic increase. Graph is between total blocks and AWR snap time in 30 minutes interval. [Click the picture below to review the graph clearly.]

Identifying the object creating this increased GC traffic is essential to identify root cause. We were able to quickly determine that this increase in GC traffic was localized around few SQL statements using ADDM and AWR reports. We decided to focus on one SQL with an obvious increase in elapsed time compared to prior week. So, first question asked, is there a change in the plan? plan_hash_value was reviewed and quickly determined that there is no change in the plan_hash_value.

Little bit of history, there were few changes performed by the vendor over the weekend as a part of few bug fixes. Vendor’s argument was that since there is no change to the plan_hash_value, SQL access plan did not change and so, this can’t be due to vendor changes. My Client’s argument was that there were no changes to the environment and problem started after the application changes.

There are many different things that can go wrong without changes to the execution plan. We can ignore those conditions for now since (a) there has been no changes to the environment (b) no visible changes to the data (c) no error message and average CR recv time is consistent with prior weeks. Well, Let’s cut to the chase. It boiled down to a question “Can SQL plan change without a change in plan_hash_value?”. What do you think? Please answer in the poll below.

plan_hash_value and hash_value

Hash_value of a SQL statement is generated from the text of an SQL statement and plan_hash_value is generated from the execution plan of that SQL statement[ More accurately, from that child cursors' execution plan and exactly what is involved in generating plan_hash_value is not published]. It is a general belief that plan_hash_value will change even if there is a slightest change in the execution plan. But, that is not always the case!

Test case

We will use a small table to explore this issue.

prompt Test case #1: Initial test case with index on columns (n1, n2).
prompt ==========
create table t1 (n1 number, n2 number, v1 varchar2(100));
create index t1_n1 on t1(n1,n2);

explain plan for select * from t1 where n1=:b1 and n2=:b2;
select * from table (dbms_xplan.display);
prompt plan hash value in this case is 626762252

Plan hash value: 626762252
-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    53 |  4134 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |    53 |  4134 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1 |     1 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("N1"=TO_NUMBER(:B1) AND "N2"=TO_NUMBER(:B2))

So, we got the plan_hash_value as highlighted above. In the following test case we will recreate the index reordering the columns as (n2, n1).

prompt Test case #2: Index with re-ordered columns
prompt ==========
drop index t1_n1;
create index t1_n1 on t1(n2,n1);

Plan hash value: 626762252

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    53 |  4134 |     9   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |    53 |  4134 |     9   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1 |  5298 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("N2"=TO_NUMBER(:B2) AND "N1"=TO_NUMBER(:B1))

Notice that in the test result above filter predicate changed reflecting index column reordering. But the plan_hash_value did not change. Point is that execution plan can change without a change in plan_hash_value (due to change in the underlying tables).

Let’s modify that index dropping a column from the index in the next test case.

prompt Test case #3: dropping a column from index t1_n1.

drop index t1_n1;
create index t1_n1 on t1(n1);

explain plan for select * from t1 where n1=:b1 and n2=:b2;

Plan hash value: 626762252

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    53 |  4134 |    41   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID| T1    |    53 |  4134 |    41   (3)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1 |  2119 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N2"=TO_NUMBER(:B2))
   2 - access("N1"=TO_NUMBER(:B1))

We see that predicates changed but the plan_hash_value did not change. In the test case below, we will modify index to be a function based index and test this SQL statement. There are also few more self-evident test cases below.

prompt Test case #3: Index is a function based index. Still, no change in the plan_hash_value.

create index t1_n1 on t1(to_char(n1));
explain plan for select * from t1 where to_char(n1)=:b1 and n2=:b2;

Plan hash value: 626762252
-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    53 |  4134 |   127   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID| T1    |    53 |  4134 |   127   (1)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1 |  2119 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N2"=TO_NUMBER(:B2))
   2 - access(TO_CHAR("N1")=:B1)

prompt Test case #4: Different schema and same SQL. plan_hash_value did not change.

Plan hash value: 626762252
-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |     1 |    78 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID| T1    |     1 |    78 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1 |     3 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("N2"=TO_NUMBER(:B2))
   2 - access("N1"=TO_NUMBER(:B1))

plan_hash_value is also case sensitive for index names. In the test cases below, we will create case sensitive indices.

prompt Test case #5: Lower case index_name.. plan_hash_value changed.
create index "t1_n1" on t1(n1,n2);
explain plan for select * from t1 where n1=:b1 and n2=:b2;

Plan hash value: 2252949961
-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    53 |  4134 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |    53 |  4134 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | t1_n1 |     1 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

prompt Test case #6: Upper case index_name.. plan_hash_value did not change.
create index "T1_N1" on t1(n1,n2);
explain plan for select * from t1 where n1=:b1 and n2=:b2;
Plan hash value: 626762252

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    53 |  4134 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |    53 |  4134 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1 |     1 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

prompt Test case #7: Space in the index name changes plan_hash_value though.
create index "T1_N1 " on t1(n1,n2);
explain plan for select * from t1 where n1=:b1 and n2=:b2;
Plan hash value: 1377541522
--------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |    53 |  4134 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1     |    53 |  4134 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_N1  |     1 |       |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Summary

In summary, plan_hash_value is a very good indicator to see if the plan changed or not. Don’t get me wrong, I also use plan_hash_value, but in addition to comparing CPU time and elapsed time. Execution plan can change even when there is no change to the plan_hash_value. Salient points from these test cases are:

  1. Plan_hash_value is dependent upon partial execution plan not on complete execution plan.
  2. If a predicate is moved from filter_predicate to access_predicate or vice-versa, it doesn’t affect plan_hash_value.
  3. Changes in the parallelism of the queries does not affect plan_hash_value. For example, if a query used 4 parallel slaves today and 16 parallel slaves yesterday, that change is not visible through plan_hash_value. This behavior is expected as parallel_adaptive_multiuser can allocate slaves depending upon the state of that instance.
  4. Plan_hash_value is case-sensitive to index/table names. Also sensitive to white-space characters.
  5. Plan_hash_value is not sensitive to index types. For example, if the index type is a function based index as in our test case #4, as long as, index name did not change plan_hash_value will remain the same.
  6. Plan_hash_value is not sensitive to schema either. SQL statement accessing different schemas can have same plan_hash_value too.

Back to our problem. One of the index was recreated removing a column and caused optimizer to apply filter predicates at table level increasing number of accesses to table block tremendously, leading to more logical reads, more Global cache waits etc. This problem was amplified since this SQL was executed very frequently and concurrently from all RAC instances.

In my client’s defense, this application change was tested thoroughly. But, alas, test data chosen for this performance test was not probing this specific issue. This performance issue did not show up in development. Essentially, chosen data in the performance benchmark suite was not an excellent choice.

As they say in Gaelic language “Go raimh maith agat” to my client for allowing me to post this blog.
This can be read in traditional format in
plan_hash_value_and_gc
Update 1: Added the document and also corrected a typo.

August 6, 2009

ORA-4031 and Shared Pool Duration

After reading my earlier post on shared pool A stroll through shared pool heap , one of my client contacted me with an interesting ORA-4031 issue. Client was getting ORA-4031 errors and shared pool size was over 4GB ( in a RAC environment). Client DBA queried v$sgastat to show that there is plenty of free memory in the shared pool. We researched the issue and it is worth blogging. Client DBA was confused as to how there can be ORA-4031 errors when the shared pool free memory is few GBs.

Heapdump Analysis

At this point, it is imperative to take heapdump in level 2 and Level 2 is for the shared pool heap dump. [ Please be warned that it is not advisable to take shared pool heap dumps excessively, as that itself can cause performance issue. During an offline conversation, Tanel Poder said that heapdump can freeze instance as his clients have experienced.]. This will create a trace file in user_dump_dest destination and that trace file is quite useful in analyzing the contents of shared pool heap. Tanel Poder has an excellent script heapdump_analyzer . I modified that script adding code for aggregation at heap, extent and type levels to debug this issue further and it is available as heapdump_dissect.ksh . ( with a special permission from Tanel to publish this script.)

Shared pool review

You can read much more about shared pool in my earlier blog entry posted above. Just as a cursory review, shared pool is split in to multiple sub heaps. In 10g, each of those sub heaps are divided in to even smaller sub heaps, let’s call it mini-heaps. For example, in this specific database, there are three sub heaps. Each of those sub heaps are further split in to four mini-heaps (1,0), (1,1), (1,2) and (1,3) each.

	sga heap(1,0)
	sga heap(1,1)
	sga heap(1,2)
	sga heap(1,3)

	sga heap(2,0)
	sga heap(2,1)
	sga heap(2,2)
	sga heap(2,3)

	sga heap(3,0)
	sga heap(3,1)
	sga heap(3,2)
	sga heap(3,3)

One or more extents are allocated to these mini-heaps dynamically as these areas grow over the course of instance life until there is no more free memory to allocate. In this blog entry, let’s focus on the contents of these sub-heaps.
Following shows output of heapdump_dissect script at extent level aggregation. As you can see sga heap (1,0) is allocated with 34 extents of 16M each. [ 16M comes from _ksmg_granule_size.]

Extent level summary
------------------
Sub heap          Extent             Type              Allocation cmt   Size
---------------- -----------------   ---------------   --------------  --------
   sga heap(1,0)          EXTENT 0              free                   1792
   sga heap(1,0)          EXTENT 0              perm   perm            15096064
   sga heap(1,0)          EXTENT 0            R-free                   616
   sga heap(1,0)          EXTENT 0            R-perm   perm            1678560
   sga heap(1,0)          EXTENT 0        R-freeable   reserved stoppe 96
   sga heap(1,0)          EXTENT 1              free                   808
   sga heap(1,0)          EXTENT 1              perm   perm            15097048
   sga heap(1,0)          EXTENT 1            R-free                   1248
   sga heap(1,0)          EXTENT 1            R-perm   perm            1677928
   sga heap(1,0)          EXTENT 1        R-freeable   reserved stoppe 96
   sga heap(1,0)          EXTENT 2              free                   456
   sga heap(1,0)          EXTENT 2              perm   perm            15097400
   sga heap(1,0)          EXTENT 2            R-free                   1679176
   sga heap(1,0)          EXTENT 2        R-freeable   reserved stoppe 96
 ...
   sga heap(1,0)         EXTENT 34              free                   160
   sga heap(1,0)         EXTENT 34              perm   perm            15097696
   sga heap(1,0)         EXTENT 34            R-free                   1679176
   sga heap(1,0)         EXTENT 34        R-freeable   reserved stoppe 96

Duration: A KGH policy

It gets interesting here. Notice that all “sga heap(1,0)” chunks has allocation comment as “perm”. Allocation comment “perm” is passed for permanent chunks. In a nutshell, all permanent chunks are allocated from first mini-heap in each of these sub heaps i.e. sga heap(1,0), sga heap(2,0) and sga heap(3,0) and so on. They are only allocated in the first mini-heap and not in any other mini-heap in these sub-heaps. Another example: PL/SQL DIANA type chunks are allocated only in fourth mini-sub-heap (1,3) (2,3) and (3,3) [in this instance].

   sga heap(1,3)          freeable   PL/SQL DIANA    1318912
...
   sga heap(2,3)          freeable   PL/SQL DIANA    1675296
...
   sga heap(3,3)          freeable   PL/SQL DIANA    1458176

Point is that there is a new KGH policy based upon duration of a chunk. All chunks are classified based upon their type ( or more formally based upon the duration of that chunk) and the chunks with the same duration is allocated from the same shared pool mini-heaps. I think, this is a great idea, especially since only few heaps of shared pool tend to be over worked and that should not unnecessarily flush other parts of the sub-heap. Another important thing to consider is that shared pool extents can be de-allocated and allocated to other parts of SGA. For example, shared pool extents can be deallocated from shared pool and allocated to buffer cache to increase buffer cache size [ These reallocated chunks will be marked with a comment KGH:NO ACCESS ]. By keeping the perm chunks in the first mini-heap, extents can be deallocated quite easily, without the need to move the perm chunks to different areas.

So, What’s the problem?

Drawback is that these chunks are contained in that mini-heap. For example, let’s say that chunks with permanent duration must be allocated and there is no free space in the first mini-heap ( say sga heap(1,0) ) and if a new extent can’t be added to that mini-heap then ORA-4031 error is thrown even if there is plenty of free space in some other mini-heap ( say sga heap(1,1), sga heap(1,2) etc). So permanent chunks will be allocated only from sga heap(N, 0) [ where N is sub-heap id 1, 2,3... ] and if that mini-heap runs out of space, ORA-4031 will be thrown.

This is why client encountered ORA-4031 errors. Even though there was plenty of free memory in other pools, simply permanent chunks can not be allocated in the first mini-heap leading to ORA-4031 errors. Of course, other chunks in that mini-heap can not be deallocated either since those chunks are also permanent chunks. Chances of these errors occurring in other sub-heaps such as sga heap(N,1), sga heap(N,2) etc are less since recreatable/freeable chunks can be flushed to accommodate incoming requests. Quick resolution was to increase shared pool_size temporarily (until we can reduce perm chunk usage due to another issue, which will be resolved soon).

Parameter:_enable_shared_pool_durations

Above undocumented parameter controls this specific KGH policy and default is true. [Note that setting an undocumented parameter needs Oracle support blessing in a production database]. I tested this in my test database setting parameter _enable_shared_pool_durations=false and mini-heaps disappeared. If this parameter is false, then duration of a chunk is not considered. Following example shows that EXTENT 0 in sga heap(1,0) contains non-perm allocations if this parameter is set to FALSE.

EXTENT 0 addr=3CC00000
...
  Chunk 3cfdf23c sz=     4096    freeable  "sql area       "  ds=3CFE4E0C
  Chunk 3cfe023c sz=      540    recreate  "KQR PO         "  latch=41AE8FD4
...

Summary

In essence, duration enabled chunk allocation policy has been introduced in 10g [ I don't know exact version]. This is usually beneficial and ORA-4031 errors are possible even if there is plenty of free space in other mini-heaps.

This can be read in pdf format from Investigations:ORA-4031 and shared pool duration.

Also, there is a patent documents “duration based memory management” available in the web (patent US 6,253,226). Those documents provides in-depth analysis.

Update 1: Looks like these mini-heaps are formally called duration from an Oracle internal resource. There are exactly 4 durations in each sub-heap and each duration has its own free list. x$ksmsp.ksmchdur column indicates the duration of that chunk.

June 20, 2009

RAC, parallel query and udpsnoop

I presented about various performance myths in my ‘battle of the nodes’ presentation. One of the myth was that how spawning parallel query slaves across multiple RAC instances can cause major bottleneck in the interconnect. In fact, that myth was direct result of a lessons learnt presentation from a client engagement. Client was suffering from performance issues with enormous global cache waits running in to 30+ms average response time for global cache CR traffic and crippling application performance. Essentially, their data warehouse queries were performing hundreds of parallel queries concurrently with slaves spawning across three node RAC instances.

Of course, I had to hide the client details and simplified using a test case to explain the myth. Looks like either a)my test case is bad or b) some sort of bug I encountered in 9.2.0.5 version c) I made a mistake in my analysis somewhere. Most likely it is the last one :-( . Greg Rahn questioned that example and this topic deserves more research to understand this little bit further. At this point, I don’t have 9.2.0.5 and database is in 10.2.0.4 and so we will test this in 10.2.0.4.

udpsnoop

UDP is one of the protocol used for cache fusion traffic in RAC and it is the Oracle recommended protocol. In this article, UDP traffic size must be measured. Measuring Global cache traffic using AWR reports was not precise. So, I decided to use a dtrace tool kit tool:udpsnoop.d to measure the traffic between RAC nodes. There are two RAC nodes in this setup. You can read more about udpsnoop.d. That tool udpsnoop.d can be downloaded from dtrace toolkit . Output of this script is of the form:

PID        LADDR           LPORT           DR         RADDR           RPORT                 SIZE
---------- --------------- --------------- ---------- --------------- --------------- -----------
15393      1.1.59.192      38395           ->         2.1.59.192      40449                 8240
...

In the output above, PID 15393 sent an UDP packet of size 8240 from IP address 192.59.1.1 to 192.59.1.2 with local port as 38395 and remote port as 40449. As UDP traffic is flying between nodes, udpsnoop.d will print UDP traffic in to a file. So, I start collecting udpsnoop output before and end that collection immediately after our script is complete. Of course, we need to aggregate this data and play with it little bit, and so, will create an external table based upon this output file too.

- This is to read the file as an external table
drop table  external_udpsnoop;

create table external_udpsnoop
(
  c_uid varchar2(10),
  pid varchar2(10),
  laddr varchar2(15),
  lport varchar2(15),
  dr   varchar2(10),
  raddr varchar2(15),
  rport varchar2(15),
  c_size   varchar2(10),
  cmd  varchar2(15)
)
organization external (
  type oracle_loader
  default directory UTL_FILE_DIR
  access parameters (
      records delimited by newline
     badfile APPS_DATA_FILE_DIR:'xtern_rpt.bad'
      logfile APPS_DATA_FILE_DIR:'xtern_rpt.log'
      discardfile APPS_DATA_FILE_DIR:'xtern_rpt.dsc'
      fields terminated by whitespace
      missing field values are null
   )
  location ('udpsnoop_ra_join_2_8th_iter.lst')
 )
  reject limit 1000
/
REM Reject limit is high since there are few packets with some junk outputs, might be due unstable fbt in dtrace.

Test case #1: Hash join – slaves from all instances

First, let’s test for an hash join to show how UDP traffic is flowing between these ports. In this test case below, we use a big table and join a 10 Million rows to another 10 Million rows table. rownum is used so that script will complete in decent time, other wise, this ran for few hours before running in to errors. This selects few non-indexed columns so that SQL must do full table scan. SQL also has hint for 16 slaves from 2 instances.

Both instances will participate in this PQ operation as parallel_instance_group is set to ALL at session level.

alter session set parallel_instance_Group='ALL';
select /*+ parallel ( t1, 8,2)   parallel (t2, 8, 2)  */
min (t1.CUSTOMER_TRX_LINE_ID +t2.CUSTOMER_TRX_LINE_ID ) , max ( t1.SET_OF_BOOKS_ID+t2.set_of_books_id  ), avg(t1.SET_OF_BOOKS_ID +t2.set_of_books_id),
	avg( t1.QUANTITY_ORDERED + t2.QUANTITY_ORDERED ), max(t1.ATTRIBUTE_CATEGORY  ), max(t2.attribute1) , max(t1.attribute2)
from
  (select * from BIG_TABLE where rownum <=100000000)t1 ,
  (select * from BIG_TABLE where rownum <=100000000)t2
where t1.CUSTOMER_TRX_LINE_ID = t2.CUSTOMER_TRX_LINE_ID
;

PQ in operation

We will also use yet another script to make sure that SQL is indeed getting 8 slaves in each instance. I don’t remember, where I got this SQL to pull slaves information (may be Doug burns, Thanks!), but anyway, I modified that little bit for RAC.

   select
      s.inst_id,
      decode(px.qcinst_id,NULL,s.username,
            ' - '||lower(substr(s.program,length(s.program)-4,4) ) ) "Username",
      decode(px.qcinst_id,NULL, 'QC', '(Slave)') "QC/Slave" ,
      to_char( px.server_set) "Slave Set",
      to_char(s.sid) "SID",
      decode(px.qcinst_id, NULL ,to_char(s.sid) ,px.qcsid) "QC SID",
      px.req_degree "Requested DOP",
     px.degree "Actual DOP", p.spid
   from
     gv$px_session px,
     gv$session s, gv$process p
   where
     px.sid=s.sid (+) and
     px.serial#=s.serial# and
     px.inst_id = s.inst_id
     and p.inst_id = s.inst_id
     and p.addr=s.paddr
  order by 5 , 1 desc
SQL> /
   INST_ID Username     QC/Slave   Slave Set  SID    QC SID Requested DOP Actual DOP SPID
---------- ------------ ---------- ---------- ------ ------ ------------- ---------- ------------
         1 SYS          QC                    10931  10931                           7366
         1  - p000      (Slave)    1          10925  10931             16         16 24762
         1  - p001      (Slave)    1          10956  10931             16         16 24764
         1  - p002      (Slave)    1          10955  10931             16         16 24766
         1  - p003      (Slave)    1          10918  10931             16         16 24768
         1  - p004      (Slave)    1          10941  10931             16         16 24778
         1  - p005      (Slave)    1          10916  10931             16         16 24781
         1  - p006      (Slave)    1          10945  10931             16         16 24787
         1  - p007      (Slave)    1          10922  10931             16         16 24795
         2  - p000      (Slave)    1          10943  10931             16         16 16920
         2  - p001      (Slave)    1          10961  10931             16         16 16923
         2  - p002      (Slave)    1          10920  10931             16         16 16970
         2  - p003      (Slave)    1          10946  10931             16         16 16972
         2  - p004      (Slave)    1          10935  10931             16         16 16974
         2  - p005      (Slave)    1          10934  10931             16         16 16976
         2  - p006      (Slave)    1          10899  10931             16         16 16988
         2  - p007      (Slave)    1          10940  10931             16         16 16991

         1 SYS          QC                    10927  10927                           9476
         2  - pz99      (Slave)    1          10890  10927              2          2 17723
         1  - pz99      (Slave)    1          10912  10927              2          2 25875
20 rows selected.

From the output above 8 slaves are from instance 1 and 8 are from instance 2 allocated, with query coordinator ( 7366) running instance 1. Above sample also captured my own session accessing gv$ views ( Notice pz99, slaves for gv$ access use different PQ slave naming conventions from 10.2 onwards).

Results

We have established that slaves were allocated from multiple instances and udpsnoop is capturing UDP packet size between these instances. We also have external table mapping to that udpsnoop output file so as to query this data. Script completed in 1500 seconds. I mapped output of px slaves query above with UDP external table and here is the table I put together to show PQ slaves and their UDP size.

  INST Username     QC/Slave   Slave SID   QC SID Req. Actual   SPID  LADDR      RADDR      RPORT SIZE
                                 Set              DOP  DOP
------ ------------ ---------- ----- ----  ------ ---  ------- ------ ---------- ---------- ----- ---------
     1 SYS          QC               10931  10931               7366
     1  - p000      (Slave)    1     10925  10931  16       16 24762
     1  - p001      (Slave)    1     10956  10931  16       16 24764
     1  - p002      (Slave)    1     10955  10931  16       16 24766
     1  - p003      (Slave)    1     10918  10931  16       16 24768
     1  - p004      (Slave)    1     10941  10931  16       16 24778
     1  - p005      (Slave)    1     10916  10931  16       16 24781
     1  - p006      (Slave)    1     10945  10931  16       16 24787
     1  - p007      (Slave)    1     10922  10931  16       16 24795
     2  - p000      (Slave)    1     10943  10931  16       16 16920  2.1.59.192 2.1.59.192 62783 127068484
     2  - p001      (Slave)    1     10961  10931  16       16 16923  2.1.59.192 2.1.59.192 62783 126904080
     2  - p002      (Slave)    1     10920  10931  16       16 16970  2.1.59.192 2.1.59.192 62783 127767353
     2  - p003      (Slave)    1     10946  10931  16       16 16972  2.1.59.192 2.1.59.192 62783 128154145
     2  - p004      (Slave)    1     10935  10931  16       16 16974  2.1.59.192 2.1.59.192 62783 128096875
     2  - p005      (Slave)    1     10934  10931  16       16 16976  2.1.59.192 2.1.59.192 62783 126057311
     2  - p006      (Slave)    1     10899  10931  16       16 16988  2.1.59.192 2.1.59.192 62783 128228830
     2  - p007      (Slave)    1     10940  10931  16       16 16991  2.1.59.192 2.1.59.192 62783 127471579

Few important points to make here:

  1. In this case, all slaves running in node 2 were talking to one UDP port in node 1 (Port 62783). lsof shows that PID 7366 (Query co-ordinator) was listening on that UDP port. Point is that these slaves from node 2 were sending packats to the co-ordinator.
  2. Interestingly, there is no UDP traffic from instance 1 to 2. I think, that looks due to the nature of aggregation in the SQL.

Few minutes later..

Interestingly, few minutes later, while I was watching UDP traffic, few other processes kicked in and started generating UDP traffic. Re-queried the database again to see what these processes are. Query has allocated 16 more slaves, 8 more running from node 1 and 8 more running in node 2 [processes p008-p015 below]. These slaves were talking to a different UDP port 62789 which was also listened by coordinator process 7366 in node 1.

  INST Username     QC/Slave   Slave SID   QC SID Req. Actual   SPID  LADDR      RADDR      RPORT SIZE
                                 Set              DOP  DOP
------ ------------ ---------- ----- ----  ------ ---  ------- ------ ---------- ---------- ----- ---------
     1 SYS          QC               10931  10931               7366
     1  - p000      (Slave)    1     10925  10931  16       16 24762
     1  - p001      (Slave)    1     10956  10931  16       16 24764
     1  - p002      (Slave)    1     10955  10931  16       16 24766
     1  - p003      (Slave)    1     10918  10931  16       16 24768
     1  - p004      (Slave)    1     10941  10931  16       16 24778
     1  - p005      (Slave)    1     10916  10931  16       16 24781
     1  - p006      (Slave)    1     10945  10931  16       16 24787
     1  - p007      (Slave)    1     10922  10931  16       16 24795

     1  - p008      (Slave)    1     10958  10931  16       16 24798
     1  - p009      (Slave)    1     10938  10931  16       16 24818
     1  - p010      (Slave)    1     10965  10931  16       16 24836
     1  - p011      (Slave)    1     10953  10931  16       16 24838
     1  - p012      (Slave)    1     10946  10931  16       16 24841
     1  - p013      (Slave)    1     10929  10931  16       16 24843
     1  - p014      (Slave)    1     10919  10931  16       16 24853
     1  - p015      (Slave)    1     10942  10931  16       16 24855

     2  - p000      (Slave)    1     10943  10931  16       16 16920  2.1.59.192 2.1.59.192 62783 127068484
     2  - p001      (Slave)    1     10961  10931  16       16 16923  2.1.59.192 2.1.59.192 62783 126904080
     2  - p002      (Slave)    1     10920  10931  16       16 16970  2.1.59.192 2.1.59.192 62783 127767353
     2  - p003      (Slave)    1     10946  10931  16       16 16972  2.1.59.192 2.1.59.192 62783 128154145
     2  - p004      (Slave)    1     10935  10931  16       16 16974  2.1.59.192 2.1.59.192 62783 128096875
     2  - p005      (Slave)    1     10934  10931  16       16 16976  2.1.59.192 2.1.59.192 62783 126057311
     2  - p006      (Slave)    1     10899  10931  16       16 16988  2.1.59.192 2.1.59.192 62783 128228830
     2  - p007      (Slave)    1     10940  10931  16       16 16991  2.1.59.192 2.1.59.192 62783 127471579

     2  - p008      (Slave)    1     10911  10931  16       16 16993  2.1.59.192 2.1.59.192 62989 182053370
     2  - p009      (Slave)    1     10949  10931  16       16 16995  2.1.59.192 2.1.59.192 62989 182490908
     2  - p010      (Slave)    1     10951  10931  16       16 17000  2.1.59.192 2.1.59.192 62989 181899025
     2  - p011      (Slave)    1     10890  10931  16       16 17007  2.1.59.192 2.1.59.192 62989 181858294
     2  - p012      (Slave)    1     10972  10931  16       16 17009  2.1.59.192 2.1.59.192 62989 182104499
     2  - p013      (Slave)    1     10950  10931  16       16 17011  2.1.59.192 2.1.59.192 62989 182334705
     2  - p014      (Slave)    1     10902  10931  16       16 17013  2.1.59.192 2.1.59.192 62989 181611641
     2  - p015      (Slave)    1     10955  10931  16       16 17023  2.1.59.192 2.1.59.192 62989 181816693

In real life..

Summing this up, approximately, 2.4GB of UDP traffic was generated with one parallel query. Can you imagine what will happen if this inter-instance parallelism is allowed in data warehouse queries scanning many tables and partitions with many hash joins? Obviously, this has the effect of saturating Interconnect quickly and so performance will suffer. Our solution was to disallow parallel queries spawning multiple instances. All of them will be running within an instance boundary and effect of this change was immediately visible in the client environment. Back to our test, parallel_execution_message_size was set to 8192. Increasing this parameter will decrease elapsed time little bit, but we are worried about saturating interconnect traffic not just elapsed time of that query.

Further, I ran this query with parallel_instance_group set to one instance and then all instances, few times. Spawning across multiple instances, in fact, increases elapsed time too.

Parallel_instance_group :ALL
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.20       0.26          0         76          0           0
Fetch        2   1481.76    1509.95     701158         76          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4   1481.96    1510.22     701158        152          0           1

parallel_instance_group :INST1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.20       0.23          0         76          0           0
Fetch        2   1321.05    1331.67     701344         76          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4   1321.25    1331.90     701344        152          0           1

What about tha original example ?

Of course, let’s talk about that original example also. In this example, there was just one table and data was aggregated.

select /*+ parallel ( t1, 8,2)  */
min (t1.CUSTOMER_TRX_LINE_ID +t1.CUSTOMER_TRX_LINE_ID ) , max ( t1.SET_OF_BOOKS_ID+t1.set_of_books_id  ), avg(t1.SET_OF_BOOKS_ID +t1.set_of_books_id),
        avg( t1.QUANTITY_ORDERED + t1.QUANTITY_ORDERED ), max(t1.ATTRIBUTE_CATEGORY  ), max(t1.attribute1) , max(t1.attribute2)
from
  BIG_TABLE t1
;

Measuring, UDP traffic, It is visible that for this huge table, Interconnect traffic is kept minimal. It looks like, there are some optimization techniques for this single table aggregation query minimizing cache fusion traffic to a minimal level, just 2152. This convinces that, just the SQL in that myth is a bad example, but that myth is still a myth. I should have used original SQL joining multiple tables with hash join for my presentation, but as a consultant, I have a responsibility to keep clients information confidential and protect. At the end of the day, they pay for my bread.

   INST_ID Username     QC/Slave   Slave Set  SID    QC SID Requested DOP Actual DOP SPID		  Size
---------- ------------ ---------- ---------- ------ ------ ------------- ---------- ------------ -------
         1 SYS          QC                    10933  10933                           3314
         1  - p000      (Slave)    1          10958  10933             16         16 24762
         1  - p001      (Slave)    1          10948  10933             16         16 24764
         1  - p002      (Slave)    1          10953  10933             16         16 24766
         1  - p003      (Slave)    1          10925  10933             16         16 24768
         1  - p004      (Slave)    1          10916  10933             16         16 24778
         1  - p005      (Slave)    1          10938  10933             16         16 24781
         1  - p006      (Slave)    1          10951  10933             16         16 24787
         1  - p007      (Slave)    1          10946  10933             16         16 24795

         2  - p000      (Slave)    1          10949  10933             16         16 16920            2152
         2  - p001      (Slave)    1          10937  10933             16         16 16923            2152
         2  - p002      (Slave)    1          10946  10933             16         16 16970            2152
         2  - p003      (Slave)    1          10956  10933             16         16 16972            2152
         2  - p004      (Slave)    1          10902  10933             16         16 16974            2152
         2  - p005      (Slave)    1          10981  10933             16         16 16976            2152
         2  - p006      (Slave)    1          10899  10933             16         16 16988            2152
         2  - p007      (Slave)    1          10927  10933             16         16 16991            2152

         1 SYS          QC                    10945  10945                           3527
         1  - pz99      (Slave)    1          10942  10945              2          2 25875
         2  - pz99      (Slave)    1          10962  10945              2          2 17723           72344

Summary

In summary, having too many parallel query slaves spawning across multiple instances can cripple interconnect. There are some optimization techniques that seems to help in the case of single table aggregation and of course, that must be considered as an exception. I have modified the presentation little bit below, but will do a second and complete update on this presentation later:

Battle of the nodes RAC performance myths doc
Battle of the nodes RAC performance myths ppt

Blog at WordPress.com.