Dynamic_plan_table, x$kqlfxpl and extreme library cache latch contention
Posted by Riyaj Shamsudeen on March 13, 2009
We had an interesting latch contention issue with a client database worth sharing. Client complained that they are seeing increased library cache latch contention in production with no logical explanation (yet). So, client wanted us to find the root cause. This performance issue seems to have started recently and occurs intermittently and also resolves by itself. Client is planning to upgrade to 10g in 4 weeks time frame and this issue need to be resolved to avoid any delays in 10g upgrade.
From statspack analysis and various statistics analysis, it is visible that there is an increase in library cache latch contention, but nothing jumps up as a problem. Of course, this is a very active application [ 5 node RAC application] with very high number of literal SQLs but that is nothing new, it is always been this way. No new dynamic SQLs, no DDLs invalidating cursors or anything of that sort. I started watching production instance, in real time, waiting for the re-occurrence of this issue. It occurred!
latchprofx for rescue
At this very juncture, instance was suffering from latch contention and It was visible that sessions were starting to queue up for latches.
I was ready with necessary tools. Tanel Poder has written a great tool latchprofx to sample v$latch_holder with high frequency [ Actually, it uses x$ tables but you get the idea ]. In this case, high frequency sampling is a must since this performance problem disappears in few minutes or so. Executed latcprofx script for library cache latches and 10000 samples as printed below.
SQL> @/tmp/latchprofx sid,name % library 10000 -- LatchProfX 1.06 by Tanel Poder ( http://www.tanelpoder.com ) SID NAME Held Held % Held ms ------ ----------------------------------- ---------- ------- ----------- 140 library cache 9555 95.55 25167.870 185 library cache 198 1.98 521.532 245 library cache 97 .97 255.498 185 library cache pin 69 .69 181.746 245 library cache pin 39 .39 102.726 240 library cache 33 .33 86.922 240 library cache pin 6 .06 15.804 227 library cache 2 .02 5.268 227 library cache pin 1 .01 2.634
Yikes. We can see that one session holding library cache latch in 95.5% of samples. That seems bit odd.
Of course, there are many latch children for library cache latches. Is that session repeatedly acquiring just one latch children? Or is it acquiring many different latch children? More importantly what is that session executing ?
Latchprofx to the rescue again. Latchprofx can also print latch children address. Re-executing latcprofx script with modified second argument to the call. Only difference between these two calls is that first argument to latchprofx includes laddr now.
@/tmp/latchprofx sid,name,laddr % library 10000 SID NAME LADDR Held Held % Held ms ------ ----------------------------------- ---------------- ---------- ------- ----------- 140 library cache 0000000990FEFA18 298 2.98 783.442 140 library cache 0000000990FED2B8 270 2.70 709.830 140 library cache 0000000990FEF298 254 2.54 667.766 140 library cache 0000000990FEF658 254 2.54 667.766 140 library cache 0000000990FEECF8 240 2.40 630.960 140 library cache 0000000990FF0198 236 2.36 620.444 140 library cache 0000000990FEE488 225 2.25 591.525 140 library cache 0000000990FEFCE8 223 2.23 586.267 140 library cache 0000000990FEEED8 222 2.22 583.638 140 library cache 0000000990FEF388 222 2.22 583.638 140 library cache 0000000990FEDEE8 220 2.20 578.380 140 library cache 0000000990FF00A8 216 2.16 567.864 140 library cache 0000000990FEE758 215 2.15 565.235 140 library cache 0000000990FEEFC8 212 2.12 557.348 140 library cache 0000000990FEE938 211 2.11 554.719 140 library cache 0000000990FEE668 205 2.05 538.945 140 library cache 0000000990FEDB28 201 2.01 528.429 ....
Whoa! This session was acquring nearly all latch children. Notice that latch addresses are different in every row output. Meaning, various samples of latchprofx saw that this session is holding a different latch children. In a nutshell, this session was acquiring and releasing ALL library cache latches ( aka children) in quasi sequential fashion and causing massive performance issue. There are 53 latch children for library cache latch (in this database). Column held% is approximately 2%, indicating all library cache latch children were acquired and released in a quasi-uniform distributed(2% times 50), quasi-cyclical fashion! Same pattern repeates if I reduce or increase samples.
Latch greedy session
Initially, I thought, this is a bug. I have not seen a case where all library cache latch children were needed to execute a SQL or a process causing massive latch contention. [ Okay, I confess, I have simulated that kind of issue in my test database holding library cache latch children using oradebug kslgetl stuff, but that is a different topic altogether]. Session 140 was executing a SQL printed below. This SQL is querying execution plan from shared_pool for a specific hash_value in this 9i database( Equivalent to dbms_xplan.display_cursor stuff in 10g.]
select plan_table_output from TABLE( dbms_xplan.display ( 'backup.dynamic_plan_table', (select rawtohex(address)||'_'||child_number x from v$sql where hash_value=&hv ), 'serial' ) ) /
Turned on sqltrace on that session and tkprof showed that recursive SQL generated from dbms_xplan package as a costly SQL.
REM Removing few lines to improve readability. SELECT /* EXEC_FROM_DBMS_XPLAN */ id, position, level , operation, options, object_name , cardinality, bytes, temp_space, cost, io_cost, cpu_cost ,decode(partition_start, 'ROW LOCATION', .... from BACKUP.DYNAMIC_PLAN_TABLE start with id = 0 and timestamp >= (select max(timestamp) from BACKUP.DYNAMIC_PLAN_TABLE where id=0 and statement_id = '00000009BB9B6D50_0' and nvl(statement_id, ' ') not like 'SYS_LE%') and nvl(statement_id, ' ') not like 'SYS_LE%' and statement_id = '00000009BB9B6D50_0' connect by (prior id = parent_id and prior nvl(statement_id, ' ') = nvl(statement_id, ' ') and prior timestamp <= timestamp) or (prior nvl(object_name, ' ') like 'SYS_LE%' and prior nvl(object_name, ' ') = nvl(statement_id, ' ') and id = 0 and prior timestamp <= timestamp) order siblings by id / call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.03 0.02 0 0 0 0 Execute 1 0.00 0.00 0 0 0 0 Fetch 30 53.79 55.90 423 375837 0 29 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 32 53.82 55.93 423 375837 0 29
X$KQLFXPL
55 seconds to execute this SQL? That seems bit excessive. This SQL is the only time-consuming recursive SQL generated from the select statement. So, somehow, this must be causing latch contention. I will print only relevent lines from the plan output.
.... | 4 | NESTED LOOPS OUTER | | 1 | 4479 | 21 (5)| | 5 | NESTED LOOPS OUTER | | 1 | 4475 | 20 (5)| |* 6 | FIXED TABLE FULL | X$KQLFXPL | 1 | 4444 | | <---- FTS | 7 | TABLE ACCESS BY INDEX ROWID| OBJ$ | 1 | 31 | 3 (34)| |* 8 | INDEX UNIQUE SCAN | I_OBJ1 | 1 | | 2 (50)| | 9 | TABLE ACCESS CLUSTER | USER$ | 1 | 4 | 2 (50)| |* 10 | INDEX UNIQUE SCAN | I_USER# | 1 | | | .... 6 - filter("P"."INST_ID"=:B1 AND "P"."KQLFXPL_HADD""P"."KQLFXPL_PHAD")
Line #6 is the problem. This table x$kqlfxpl is accessed using Full table scan. Looks like, x$kqlfxpl can NOT be accessed without holding library cache latches. Since there is a FTS on this fixed table, almost every library cache object need to be touched which potentially means all library cache objects must be inspected. Further, accessing this x$ table means library cache buckets (_kgl_bucket_count controlled area) need to be walked, latches acquired and released before that memory area is inspected. Makes sense!
Repeated calls to kqlfxp, kglic, kglic0 also visible in the stack output. These calls suggests that library cache latches were acquired and released. Systemstate dumps confirms that too.
I was also able to reproduce this issue querying x$kqlfxpl in a cloned database.
SQL> select /*+ full(a) */ count(*) from x$kqlfxpl a; COUNT(*) ---------- 48926 ------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| ------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 18 (6)| | 1 | SORT AGGREGATE | | 1 | | | | 2 | FIXED TABLE FULL | X$KQLFXPL | 100 | | | -------------------------------------------------------------------------
Running latchprofx script from another session concurrently shows that accessing this x$ table will result in excessive library cache latching activity. Row estimates column for this table shows 100 which is a CBO default cardinality estimate if there are no statistics on a table.
@/tmp/latchprofx sid,name,laddr % library 10000 SID NAME LADDR Held Held % Held ms ------ ----------------------------------- ---------------- ---------- ------- ----------- 127 library cache 0000000990FED2B8 26 .26 68.120 127 library cache 0000000990FEFEC8 24 .24 62.880 127 library cache 0000000990FEF658 23 .23 60.260 127 library cache 0000000990FEE668 22 .22 57.640 127 library cache 0000000990FEFA18 22 .22 57.640 ...
Background
Now, a little bit background is necessary. What started all this? As a part of 10g upgrade planning, Client DBAs had setup a job to capture costly SQLs and store execution plans from shared_pool using dynamic_plan_table script, borrowed from Tom kyte dynamic_plan_table discussion. Excellent idea and wonderfully written sql, Tom.
If you don’t know already, Tom’s method creates following view. Client script passes the hash_value of costly SQLs from v$sql to this view.
create or replace view dynamic_plan_table as select rawtohex(address) || '_' || child_number statement_id, sysdate timestamp, operation, options, object_node, object_owner, object_name, 0 object_instance, optimizer, search_columns, id, parent_id, position, cost, cardinality, bytes, other_tag, partition_start, partition_stop, partition_id, other, distribution, cpu_cost, io_cost, temp_space, access_predicates, filter_predicates from v$sql_plan ; REM there is a wrapper script calling following SQL, in a loop, for top 20 SQL statements at a stretch. REM Of course, there is more logic in that script. select plan_table_output from TABLE( dbms_xplan.display ( 'backup.dynamic_plan_table', (select rawtohex(address)||'_'||child_number x from v$sql where hash_value=&hv ), 'serial' ) ) /
Unfortunately, such an awesome method has side effects in version 9.2.0.8, as explained above.
At this point, I am almost sure that we need to collect statistics on fixed tables to resolve this issue. There is a fixed index on x$kqlfxpl (KQLFXPL_HADD, KQLFXPL_PHAD,KQLFXPL_HASH) and if that index is used, effect of this problem should disappear. Since there is a plan to upgrade to 10g soon, client didn’t want to collect fixed statistics.
So, I worked around the issue. Not an elegant solution, but it avoids problem by recreating the view with hardcoded hash value for every loop.
create or replace view backup.dynamic_plan_table as select rawtohex(address) || '_' || child_number statement_id, sysdate timestamp, operation, options, object_node, object_owner, object_name, 0 object_instance, optimizer, search_columns, id, parent_id, position, cost, cardinality, bytes, other_tag, partition_start, partition_stop, partition_id, other, distribution, cpu_cost, io_cost, temp_space, access_predicates, filter_predicates from v$sql_plan where hash_value=&&hv -- Added this line.. ; select plan_table_output from TABLE( dbms_xplan.display ( 'backup.dynamic_plan_table', (select rawtohex(address)||'_'||child_number x from v$sql where hash_value=&&hv ), 'serial' ) ) /
Now, it is evident that query runs in 0.03 seconds and also used fixed index.
51 NESTED LOOPS OUTER (cr=261 r=0 w=0 time=1327 us) 51 NESTED LOOPS OUTER (cr=131 r=0 w=0 time=892 us) 51 FIXED TABLE FIXED INDEX X$KQLFXPL (ind:3) (cr=0 r=0 w=0 time=496 us) 27 TABLE ACCESS BY INDEX ROWID OBJ$ (cr=131 r=0 w=0 time=327 us) 27 INDEX UNIQUE SCAN I_OBJ1 (cr=104 r=0 w=0 time=212 us)(object id 33) 27 TABLE ACCESS CLUSTER USER$ (cr=130 r=0 w=0 time=359 us) 27 INDEX UNIQUE SCAN I_USER# (cr=28 r=0 w=0 time=78 us)(object id 11)
Finally
10g upgrade guide recommends collecting fixed table statistics. So, this issue should completely disappear in 10g.
radino said
I use application context to parametrize this view..
…
where hash_value=sys_context(‘your_ctx’,’hash’);
orainternals said
Hi Radino
Thanks for visiting this blog. Yes, use of context variable could have been better than recreating view.
Still, my preference is to resolve root cause so that optimizer can choose first_rows.
Cheers
Riyaj
Dion Cho said
Very interesting case.
Thanks for sharing it.
Heavy reads on the library cache related v$ views. « Dion Cho - Oracle Performance Storyteller said
[…] leave a comment » Riyaji posted a very interesting case here. […]
Martin Decker said
Hi Riyaj,
great analysis! I have used the technique before (http://www.ora-solutions.net/web/2008/10/19/formatted-sql-plan-display-in-oracle-9i/) but was not aware of its drawbacks.
Regards,
Martin
orainternals said
Thanks Martin. It was nice to meet you at hotsos 2009 dinner.
Cheers
Riyaj
Pythian Group - Blog said
[…] Shamsudeen also was sleuthing—into dynamic_plan_table, x$kqlfxpl and extreme library cache latch contention. Dion Cho, the Oracle Performance Storyteller, responded with an item about a similar experience of […]
Tanel Poder said
Hi Riyaj,
Very cool article! (and using the right approach too 😉
I’m happy to see that people find my tools useful!
Btw, you probably know this (but I didn’t see it mentioned) that you can see the function names responsible for these latch holds from LatchProfX output too, also the latch hold mode if dealing with a shared latch:
SQL> @latchprofx sid,name,laddr,ksllwnam,ksllwlbl,hmode 159 % 100000
-- LatchProfX 1.08 by Tanel Poder ( http://www.tanelpoder.com )
SID NAME LADDR KSLLWNAM KSLLWLBL HMODE Held Gets Held % Held ms Avg hold ms
---- -------------- ---------------- --------- --------- ---------- ------ ----- ------- -------- -----------
159 library cache 00000003A93513E0 kglic child exclusive 10044 9314 10.04 301.320 .032
159 library cache 00000003A9351340 kglic child exclusive 9895 8458 9.90 296.850 .035
159 library cache 00000003A9351660 kglic child exclusive 9761 8440 9.76 292.830 .035
159 library cache 00000003A9351700 kglic child exclusive 9737 8924 9.74 292.110 .033
159 library cache 00000003A9351480 kglic child exclusive 8999 7765 9.00 269.970 .035
159 library cache 00000003A93515C0 kglic child exclusive 8553 7832 8.55 256.590 .033
159 library cache 00000003A9351520 kglic child exclusive 6852 6828 6.85 205.560 .030
7 rows selected.
orainternals said
Tanel
Thanks for visiting my blog. Your tools are excellent and thanks for sharing them.
Cheers
Riyaj
Another LatchProfX use case | Tanel Poder's blog: Core IT for Geeks and Pros said
[…] Shamsudeen wrote an excellent article about systematic latch contention […]
Log Buffer #140: A Carnival of the Vanities for DBAs | WORDPRESS EXTENSIONS-PLUGINS-THEMES-TEMPLATES said
[…] Shamsudeen also was sleuthing—into dynamic_plan_table, x$kqlfxpl and extreme library cache latch contention. Dion Cho, the Oracle Performance Storyteller, responded with an item about a similar experience of […]