Cardinality feedback to resolve a Cache buffers chains latch contention issue

Posted by Riyaj Shamsudeen on January 2, 2009

Earlier, I blogged about resolving cache buffers chains latch contention in my earlier entry , in which, root cause was excessive index access due to Nested Loops join. Recently, we resolved another similar issue.

Problem

CPU usage was very high in production database server in user% mode. Dynamic performance view v$session_wait indicated excessive waits for latch contention. Output from a script wait_details.sql (scripts available at the end of this post) shows that many sessions were waiting for ‘latch free’ event. Also, address for these latch children are the same, meaning all these sessions are trying to access one latch children.


SQL> @wait_details

   SID PID     EVENT         USERNAME  STATE               WAIT_TIME   WIS P1_P2_P3_TEXT
------ ------- ------------- --------- ---------- ------------------- --------- ----- -----------------------------------
    91  24242  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   101   4884  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   116  23899  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   187  19499  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   108  23498  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 3
   194  23701  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   202  26254  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 4
   220  23274  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   227  23643  latch free    CSTMOP    WAITED KNOWN TIME           2     0 address 69476807024-number 98-tries 0
   331  26519  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   297  23934  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 3
....

We can identify SQL causing latch contention querying v$session_wait. From the output below, SQL with hash_value 1509082258 is suspicious since there are many sessions executing that SQL and waiting / waited recently for ‘latch free’ event.


select substr(w.event, 1, 28 ) event, sql_hash_value, count(*)
from v$session_wait w, v$session s, v$process p
where s.sid=w.sid
and p.addr = s.paddr
and s.username is not null
and event not like '%pipe%'
and event not like 'SQL*%'
group by substr(w.event, 1, 28), sql_hash_value
;

EVENT                          SQL_HASH_VALUE   COUNT(*)
------------------------------ -------------- ----------
enqueue                               3740270          1
enqueue                             747790152          1
enqueue                            1192921796          1
latch free                          622474477          3
latch free                         1509082258         58 <---
latch free                         1807800540          1
global cache null to x                3740270          1
global cache null to x             1473456670          1
global cache null to x             3094935671          1
db file sequential read             109444956          1

Mapping to object_name

We need to map child latch address 1509082258 to an object. Fortunately, using a script latch_cbc_to_buf.sql(script in the end of this post) written earlier we were able to do that mapping quickly. This script prints touch count for those buffers too.


REM Not all columns are shown below.
SQL>@latch_cbc_to_buf.sql
HLADDR                  TCH OWNER                          OBJECT_NAME              OBJECT_TYPE        
---------------- ---------- ------------------------------ ------------------------ ------------------ 
000000102D23F170        336 CCWINV                         CUS_MTL_MATERIAL_TXNS_C3 INDEX             
000000102D23F170         51 APPLSYS                        FND_CONCURRENT_REQUESTS  TABLE              
000000102D23F170         47 AR                             HZ_PARTY_SITES           TABLE              
...

From the output above, we know that CUS_MTL_MATERIAL_TXNS_C3 index is at the heart of this latch contention issue since that object has higher touch count than other objects protected by that child latch.

SQL and execution plan

Querying v$sql, SQL associated with this hash value was retrieved. Execution plan for this SQL is very long and has many branches joined by ‘union all’ operation. Searching for the index CUS_MTL_MATERIAL_TXNS_C3 in the execution plan shows that use of this index, in the last two branches of execution plan. For clarity, only part of the plan is printed below. [ Note: v$sql_plan also confirmed this execution plan.]


explain plan for sql_here ;
select * from table(dbms_xplan.display);

| 122 |        VIEW                                 |                             |     1 |    15 |            |
| 123 |         SORT GROUP BY                       |                             |     1 |    32 |    22  (10)|
| 124 |          VIEW                               |                             |     1 |    32 |            |
| 125 |           SORT UNIQUE                       |                             |     1 |   120 |    22  (10)|
|*126 |            TABLE ACCESS BY INDEX ROWID      | MTL_MATERIAL_TRANSACTIONS   |     1 |    29 |     5  (20)|
| 127 |             NESTED LOOPS                    |                             |     1 |   120 |    21   (5)|
| 128 |              MERGE JOIN CARTESIAN           |                             |     1 |    91 |    17   (6)|
| 129 |               NESTED LOOPS                  |                             |     1 |    59 |     4  (25)|
| 130 |                TABLE ACCESS BY INDEX ROWID  | RCV_TRANSACTIONS_INTERFACE  |    39 |   507 |     3  (34)|
|*131 |                 INDEX FULL SCAN             | CUS_RCV_TXNS_INTERFACE_C3   |    39 |       |     2  (50)|
|*132 |                TABLE ACCESS BY INDEX ROWID  | RCV_SHIPMENT_HEADERS        |     1 |    46 |     2  (50)|
|*133 |                 INDEX UNIQUE SCAN           | RCV_SHIPMENT_HEADERS_U1     |     1 |       |     2  (50)|
| 134 |               BUFFER SORT                   |                             |    71 |  2272 |    15   (0)|
| 135 |                INLIST ITERATOR              |                             |       |       |            |
| 136 |                 TABLE ACCESS BY INDEX ROWID | CUS_INV_RTL_DOCUMENTS       |    71 |  2272 |    14   (8)|
|*137 |                  INDEX RANGE SCAN           | CUS_INV_RTL_DOCUMENTS_N4    |    71 |       |     4  (25)|
| 138 |              INLIST ITERATOR                |                             |       |       |            |
|*139 |               INDEX RANGE SCAN              | CUS_MTL_MATERIAL_TXNS_C3    |     1 |       |     4  (25)|
--------------------------------------------------------------

Line #128 is a key indicator of the problem. Rows from steps 129 and 134 are joined using cartesian merge join method!. Obviously a cartesian join will generate huge amount of rows as there will be no join conditions between those two row sources [similar to a cartesian product]. Resultant rows of this cartesian join are, further, joined using Nested loops join method to MTL_MATERIAL_TRANSACTIONS through the index CUS_MTL_MATERIAL_TXNS_C3. The reason CBO chose a cartesian join is that the cardinality estimate at step 129 is 1, which is incorrect [but that is a different topic altogether ].

So far, we know why that index blocks are accessed frequently: A side effect of cartesian merge join producing enormous amount of rows. If this SQL is executed from many different sessions concurrently, effect of latch contention on index root block will be magnified.

What changed ?

This is an existing application and was working fine until few hours earlier. So, what changed?

Statistics. As a process, we collect statistics in a cloned copy of production database and then import those statistics in to production database. There were few other reorgs performed over the weekend, but that doesn’t seem to have any negative effect. We were fortunate enough to have another development environment with 1 month old data and statistics. Comparing execution plan for that branch of SQL in the development instance, reveals something peculiar and interesting.

--------------------------------------------------------------------------------------------------------
| Id  | Operation                           |  Name                       | Rows  | Bytes | Cost (%CPU)|
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                             |     1 |    33 |   313   (1)|
|   1 |  SORT GROUP BY                      |                             |     1 |    33 |   313   (1)|
|   2 |   VIEW                              |                             |     1 |    33 |            |
|   3 |    SORT UNIQUE                      |                             |     1 |   122 |   313   (1)|
|   4 |     TABLE ACCESS BY INDEX ROWID     | RCV_TRANSACTIONS_INTERFACE  |     1 |    14 |     3  (34)|
|   5 |      NESTED LOOPS                   |                             |     1 |   122 |   312   (1)|
|   6 |       NESTED LOOPS                  |                             |     1 |   108 |   310   (1)|
|   7 |        NESTED LOOPS                 |                             |     1 |    62 |   307   (1)|
|   8 |         INLIST ITERATOR             |                             |       |       |            |
|   9 |          TABLE ACCESS BY INDEX ROWID| CUS_INV_RTL_DOCUMENTS       |    73 |  2336 |    14   (8)|
|* 10 |           INDEX RANGE SCAN          | CUS_INV_RTL_DOCUMENTS_N4    |    73 |       |     4  (25)|
|  11 |         INLIST ITERATOR             |                             |       |       |            |
|* 12 |          TABLE ACCESS BY INDEX ROWID| MTL_MATERIAL_TRANSACTIONS   |     1 |    30 |     5  (20)|
|* 13 |           INDEX RANGE SCAN          | CUS_MTL_MATERIAL_TXNS_C3    |     1 |       |     4  (25)|
|* 14 |        TABLE ACCESS BY INDEX ROWID  | RCV_SHIPMENT_HEADERS        |     1 |    46 |     4  (25)|
|* 15 |         INDEX RANGE SCAN            | RCV_SHIPMENT_HEADERS_N2     |     1 |       |     3  (34)|
|* 16 |       INDEX RANGE SCAN              | CUS_RCV_TXNS_INTERFACE_C3   |     5 |       |     2  (50)|
--------------------------------------------------------------------------------------------------------

Predicate information:
---------------------
...
  16 - access("RT"."SHIPMENT_HEADER_ID"="RSH"."SHIPMENT_HEADER_ID")
       filter("RT"."SHIPMENT_HEADER_ID" IS NOT NULL)
...

Cardinality estimates for RCV_TRANSACTIONS_INTERFACE, for identical predicates, are 5 (Step #16 )in the efficient plan (development database) and 39 in the inefficient plan (Production database). This increase in cardinality caused optimizer to choose a completely different plan. Interestingly enough, RCV_TRANSACTIONS_INTERFACE is an interface table and while collecting statistics on this table in pre-production environment, we had a special case transaction. This invalid state of the table generated not-so-good statistics, which was transferred to production.

Easy enough, recollecting statistics on RCV_TRANSACTIONS_INTERFACE table reverted execution plan back to older efficient plan.

Summary

In summary, we were able to pin-point the object through cardinality feedback method. With few scripts, we were able to identify the object and resolved the root cause of this performance issue.

Scripts

--------------------------wait_details_rac.sql -------------------------------
column event format A40
column PID format A10
column P1_P2_P3_TEXT format A50
column username format  A10
column osuser format A10
column event format A30
column sid format 99999
column wait_time format 999
column wait_time format 99
column wis format 9999
column inst_id format 99
set lines 170
set pages 100
select w.inst_id, w.sid, to_char(p.spid,'99999') PID,
substr(w.event, 1, 28) event,
substr(s.username,1,20) username,
substr(s.osuser, 1,10) osuser,
w.state,
w.wait_time, w.seconds_in_wait wis,
substr(w.p1text||' '||to_char(w.P1)||'-'||
w.p2text||' '||to_char(w.P2)||'-'||
w.p3text||' '||to_char(w.P3), 1, 45)
P1_P2_P3_TEXT --, module
from gv$session_wait w, gv$session s, gv$process p
where s.sid=w.sid
and p.addr = s.paddr
and s.username is not null
and w.event not like 'SQL*%'
and w.event not like '%pipe%'
and w.inst_id = s.inst_id
and s.inst_id =p.inst_id
order by w.inst_id
/

----------------------latch_cbc_to_buf.sql----------------------------------------------------------
REM Author: Riyaj Shamsudeen @OraInternals, LLC
REM         www.orainternals.com
REM
REM Functionality: This script is to print objects hanging from latch addresses, processes are waiting for.
REM **************
REM
REM This quickly helps us to identify object_names.
REM
REM Note : 1. This SQL can be modified to use GV$ views, but beware that can saturate the cluster interconnect.
REM        2. Keep window 160 columns for better visibility.
REM
REM Exectution type: Execute from sqlplus or any other tool.
REM
REM
REM No implied or explicit warranty
REM
REM Please send me an email to rshamsud@orainternals.com, if you enhance this script :-)
REM --------------------------------------------------------------------------------------------------

with bh_lc as
(select /*+ ORDERED */
 lc.addr, lc.child#, lc.gets, lc.misses, lc.immediate_gets, lc.immediate_misses, lc.spin_gets, lc.sleeps,
 bh.hladdr, bh.tch tch, bh.file#, bh.dbablk, bh.class, bh.state, bh.obj
  from
   x$kslld  ld,
   v$session_wait sw,
   v$latch_children lc,
   x$bh bh
  where lc.addr =sw.p1raw
  and sw.p2= ld.indx
  and ld.kslldnam='cache buffers chains'
  and lower(sw.event) like '%latch%'
     --     and state='WAITING'
  and bh.hladdr=lc.addr
)
select  bh_lc.hladdr, bh_lc.tch, o.owner, o.object_name, o.object_type, bh_lc.child#, bh_lc.gets, bh_lc.misses, bh_lc.immediate_gets,
bh_lc.immediate_misses, spin_gets, sleeps
 from
  bh_lc, dba_objects o
 where bh_lc.obj = o.object_id(+)
union
select   bh_lc.hladdr, bh_lc.tch, o.owner, o.object_name, o.object_type, bh_lc.child#, bh_lc.gets, bh_lc.misses, bh_lc.immediate_gets,
bh_lc.immediate_misses, spin_gets, sleeps
from
  bh_lc, dba_objects o
 where bh_lc.obj = o.data_object_id(+)
order by 1,2 desc
/

Oracle version 9.2.0.8 Solaris platform
[ To read more about cardinality feedback, refer Wolfgang’s excellent presentation. ]
This blog also can be read in traditional format in investigations: Cardinality feedback for latch contention resolution.

This entry was posted on January 2, 2009 at 8:21 pm and is filed under CBO, EBS11i, Performance tuning. Tagged: cache buffers chain, CBO, latches, oracle, oracle performance, performance, v$session_wait. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

7 Responses to “Cardinality feedback to resolve a Cache buffers chains latch contention issue”

lascoltodelvenerdi said

January 7, 2009 at 1:55 pm
Well, the key to the problem, seams to be the use of old development environment.

If you were not so lucky…the work would be a bit harder. You should check e estimate (from the analyze) and real cardinality of all the tables in query.

Am I right?

Another problem could have been a short-table that become a “big” one. In which case you have to rework the query or rethink some Oracle parameters.

Am I right?

Reply
orainternals said

January 7, 2009 at 2:54 pm
Hello lascoltodelvenerdi

Thanks for reading my blog.

Core of this blog entry is concentrating on root cause analysis for latch contention. SQL plan change is the root cause and cardinality feedback comes in to picture to answer why CBO chose a different plan.

Yes, we were fortunate enough to have older plan for a quick comparison. Of course, there are other ways to achieve this result i.e. reviewing 10053 trace, export/import of older statistics to development, review of statspack plans, comparing real cardinality and estimates etc.

Hypothetically speaking, If a short-table has grown to bigger one [ and caused this issue], then we would have resolved it a) by importing older statistics [ and thereby relieving immediate pressure upon us] b) then, analyzing this SQL further in the production copy for a permanent solution.

Cheers
Riyaj

Reply
Alexwebmaster said

March 3, 2009 at 12:38 pm
Hello webmaster
I would like to share with you a link to your site
write me here preonrelt@mail.ru

Reply
http://mensengagementrings.ca said

March 18, 2012 at 9:07 pm
I am glad to be one of the visitants on this outstanding web site (:, thank you for putting up.

Reply
Ganesh said

January 22, 2017 at 7:01 pm
Hi Riyaj

Great Article. I’m your blog follower. If I get any issue related to locking or contention, first I try to go through your blog and try to understand what’s the real issue. The way of your presenting every article is excellent.

But I’m not able to download your scripts in this blog. It will be helpful for every blog follower of you like me. Thanks

Reply
- Riyaj Shamsudeen said
  
  January 22, 2017 at 7:55 pm
  Hello Ganesh,
  Thanks for pointing out the script issue. I updated the blog entry. I lost all of my orainternals.com content due to an hacking issue. I have become too busy (or too lazy ?) to bring up that site. If you find more script issues, let me know, I will upload the scripts.
  
  Reply
  - Ganesh said
    
    January 26, 2017 at 3:16 am
    Hi Riyaj
    
    Great !!! Thanks much for the reply and uploading the scripts. We got Golden eggs 🙂 if you are uploading related scripts for every article and presentations, it will be really helpful for us. Really appreciated and thankful always.
    
    Regards
    Ganesh

Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

Categories

Archives

Click image for details:

Links for training, scripts etc

Blogroll oracle

Blogroll