Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

Posts Tagged ‘oracle’

Shared pool freelists (and durations)

Posted by Riyaj Shamsudeen on August 9, 2009

My earlier blog about shared pool duration got an offline response from one of my reader:
” So, you say that durations aka mini-heaps have been introduced from 10g onwards. I have been using Steve Adams’ script shared_pool_free_lists.sql. Is that not accurate anymore?”


Shared pool free lists

I have a great respect for Steve Adams . In many ways, he has been a great virtual mentor and his insights are so remarkable.

Coming back to the question, I have used Steve’s script before and it is applicable prior to Oracle version 9i. In 9i, sub-heaps were introduced. Further, shared pool durations were introduced in Oracle version 10g. So, his script may not be applicable from version 9i onwards. We will probe this further in this blog.

This is the problem with writing anything about internals stuff, they tend to change from version to version and In many cases, our work can become obsolete in future releases(including this blog!).

Keep Reading

Posted in 11g, Oracle database internals, Performance tuning, shared_pool | Tagged: , , , , , , , , , , , | 2 Comments »

RAC, parallel query and udpsnoop

Posted by Riyaj Shamsudeen on June 20, 2009

I presented about various performance myths in my ‘battle of the nodes’ presentation. One of the myth was that how spawning parallel query slaves across multiple RAC instances can cause major bottleneck in the interconnect. In fact, that myth was direct result of a lessons learnt presentation from a client engagement. Client was suffering from performance issues with enormous global cache waits running in to 30+ms average response time for global cache CR traffic and crippling application performance. Essentially, their data warehouse queries were performing hundreds of parallel queries concurrently with slaves spawning across three node RAC instances.

Of course, I had to hide the client details and simplified using a test case to explain the myth. Looks like either a)my test case is bad or b) some sort of bug I encountered in version c) I made a mistake in my analysis somewhere. Most likely it is the last one :-(. Greg Rahn questioned that example and this topic deserves more research to understand this little bit further. At this point, I don’t have and database is in and so we will test this in


UDP is one of the protocol used for cache fusion traffic in RAC and it is the Oracle recommended protocol. In this article, UDP traffic size must be measured. Measuring Global cache traffic using AWR reports was not precise. So, I decided to use a dtrace tool kit tool:udpsnoop.d to measure the traffic between RAC nodes. There are two RAC nodes in this setup. You can read more about udpsnoop.d. That tool udpsnoop.d can be downloaded from dtrace toolkit . Output of this script is of the form:

PID        LADDR           LPORT           DR         RADDR           RPORT                 SIZE
---------- --------------- --------------- ---------- --------------- --------------- -----------
15393      38395           ->      40449                 8240

Keep Reading

Posted in Oracle database internals, Performance tuning, RAC | Tagged: , , , , , , , , , , | 15 Comments »

Dynamic_plan_table, x$kqlfxpl and extreme library cache latch contention

Posted by Riyaj Shamsudeen on March 13, 2009

We had an interesting latch contention issue with a client database worth sharing. Client complained that they are seeing increased library cache latch contention in production with no logical explanation (yet). So, client wanted us to find the root cause. This performance issue seems to have started recently and occurs intermittently and also resolves by itself. Client is planning to upgrade to 10g in 4 weeks time frame and this issue need to be resolved to avoid any delays in 10g upgrade.

From statspack analysis and various statistics analysis, it is visible that there is an increase in library cache latch contention, but nothing jumps up as a problem. Of course, this is a very active application [ 5 node RAC application] with very high number of literal SQLs but that is nothing new, it is always been this way. No new dynamic SQLs, no DDLs invalidating cursors or anything of that sort. I started watching production instance, in real time, waiting for the re-occurrence of this issue. It occurred!

latchprofx for rescue

At this very juncture, instance was suffering from latch contention and It was visible that sessions were starting to queue up for latches.

I was ready with necessary tools. Tanel Poder has written a great tool latchprofx to sample v$latch_holder with high frequency [ Actually, it uses x$ tables but you get the idea ]. In this case, high frequency sampling is a must since this performance problem disappears in few minutes or so. Executed latcprofx script for library cache latches and 10000 samples as printed below.

SQL>  @/tmp/latchprofx  sid,name % library 10000

-- LatchProfX 1.06 by Tanel Poder ( )

  SID NAME                                      Held  Held %     Held ms
------ ----------------------------------- ---------- ------- -----------
   140 library cache                             9555   95.55   25167.870
   185 library cache                              198    1.98     521.532
   245 library cache                               97     .97     255.498
   185 library cache pin                           69     .69     181.746
   245 library cache pin                           39     .39     102.726
   240 library cache                               33     .33      86.922
   240 library cache pin                            6     .06      15.804
   227 library cache                                2     .02       5.268
   227 library cache pin                            1     .01       2.634

Yikes. We can see that one session holding library cache latch in 95.5% of samples. That seems bit odd.

Of course, there are many latch children for library cache latches. Is that session repeatedly acquiring just one latch children? Or is it acquiring many different latch children? More importantly what is that session executing ?

Latchprofx to the rescue again. Latchprofx can also print latch children address. Re-executing latcprofx script with modified second argument to the call. Only difference between these two calls is that first argument to latchprofx includes laddr now.

@/tmp/latchprofx  sid,name,laddr % library 10000

   SID NAME                                LADDR                  Held  Held %     Held ms
------ ----------------------------------- ---------------- ---------- ------- -----------
   140 library cache                       0000000990FEFA18        298    2.98     783.442
   140 library cache                       0000000990FED2B8        270    2.70     709.830
   140 library cache                       0000000990FEF298        254    2.54     667.766
   140 library cache                       0000000990FEF658        254    2.54     667.766
   140 library cache                       0000000990FEECF8        240    2.40     630.960
   140 library cache                       0000000990FF0198        236    2.36     620.444
   140 library cache                       0000000990FEE488        225    2.25     591.525
   140 library cache                       0000000990FEFCE8        223    2.23     586.267
   140 library cache                       0000000990FEEED8        222    2.22     583.638
   140 library cache                       0000000990FEF388        222    2.22     583.638
   140 library cache                       0000000990FEDEE8        220    2.20     578.380
   140 library cache                       0000000990FF00A8        216    2.16     567.864
   140 library cache                       0000000990FEE758        215    2.15     565.235
   140 library cache                       0000000990FEEFC8        212    2.12     557.348
   140 library cache                       0000000990FEE938        211    2.11     554.719
   140 library cache                       0000000990FEE668        205    2.05     538.945
   140 library cache                       0000000990FEDB28        201    2.01     528.429

Whoa! This session was acquring nearly all latch children. Notice that latch addresses are different in every row output. Meaning, various samples of latchprofx saw that this session is holding a different latch children. In a nutshell, this session was acquiring and releasing ALL library cache latches ( aka children) in quasi sequential fashion and causing massive performance issue. There are 53 latch children for library cache latch (in this database). Column held% is approximately 2%, indicating all library cache latch children were acquired and released in a quasi-uniform distributed(2% times 50), quasi-cyclical fashion! Same pattern repeates if I reduce or increase samples.

Latch greedy session

Initially, I thought, this is a bug. I have not seen a case where all library cache latch children were needed to execute a SQL or a process causing massive latch contention. [ Okay, I confess, I have simulated that kind of issue in my test database holding library cache latch children using oradebug kslgetl stuff, but that is a different topic altogether]. Session 140 was executing a SQL printed below. This SQL is querying execution plan from shared_pool for a specific hash_value in this 9i database( Equivalent to dbms_xplan.display_cursor stuff in 10g.]

select plan_table_output from TABLE( dbms_xplan.display ( 'backup.dynamic_plan_table',
                (select rawtohex(address)||'_'||child_number x from v$sql where hash_value=&hv ), 'serial' ) )

Turned on sqltrace on that session and tkprof showed that recursive SQL generated from dbms_xplan package as a costly SQL.

REM Removing few lines to improve readability.
SELECT /* EXEC_FROM_DBMS_XPLAN */ id, position, level , operation, options, object_name , cardinality, bytes, temp_space,
        cost, io_cost, cpu_cost ,decode(partition_start, 'ROW LOCATION',
    		from BACKUP.DYNAMIC_PLAN_TABLE start  with id = 0
            and  timestamp >= 
		(select max(timestamp) from BACKUP.DYNAMIC_PLAN_TABLE where id=0  and 
			statement_id = '00000009BB9B6D50_0' and nvl(statement_id, ' ') not like 'SYS_LE%')
            and  nvl(statement_id, ' ') not like 'SYS_LE%' and statement_id = '00000009BB9B6D50_0' connect by (prior id = parent_id
                      and prior nvl(statement_id, ' ') =
                                nvl(statement_id, ' ')
                      and prior timestamp <= timestamp)
                  or (prior nvl(object_name, ' ') like 'SYS_LE%'
                      and  prior nvl(object_name, ' ') =
                                 nvl(statement_id, ' ')
                      and id = 0 and prior timestamp <= timestamp)
          order siblings by id

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.03       0.02          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch       30     53.79      55.90        423     375837          0          29
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       32     53.82      55.93        423     375837          0          29


55 seconds to execute this SQL? That seems bit excessive. This SQL is the only time-consuming recursive SQL generated from the select statement. So, somehow, this must be causing latch contention. I will print only relevent lines from the plan output.

|   4 |     NESTED LOOPS OUTER           |             |     1 |  4479 |    21   (5)|
|   5 |      NESTED LOOPS OUTER          |             |     1 |  4475 |    20   (5)|
|*  6 |       FIXED TABLE FULL           | X$KQLFXPL   |     1 |  4444 |            | <---- FTS
|   7 |       TABLE ACCESS BY INDEX ROWID| OBJ$        |     1 |    31 |     3  (34)|
|*  8 |        INDEX UNIQUE SCAN         | I_OBJ1      |     1 |       |     2  (50)|
|   9 |      TABLE ACCESS CLUSTER        | USER$       |     1 |     4 |     2  (50)|
|* 10 |       INDEX UNIQUE SCAN          | I_USER#     |     1 |       |            |

   6 - filter("P"."INST_ID"=:B1 AND "P"."KQLFXPL_HADD""P"."KQLFXPL_PHAD")

Line #6 is the problem. This table x$kqlfxpl is accessed using Full table scan. Looks like, x$kqlfxpl can NOT be accessed without holding library cache latches. Since there is a FTS on this fixed table, almost every library cache object need to be touched which potentially means all library cache objects must be inspected. Further, accessing this x$ table means library cache buckets (_kgl_bucket_count controlled area) need to be walked, latches acquired and released before that memory area is inspected. Makes sense!

Repeated calls to kqlfxp, kglic, kglic0 also visible in the stack output. These calls suggests that library cache latches were acquired and released. Systemstate dumps confirms that too.

I was also able to reproduce this issue querying x$kqlfxpl in a cloned database.

SQL>  select /*+ full(a) */  count(*) from x$kqlfxpl a;


| Id  | Operation            |  Name       | Rows  | Bytes | Cost (%CPU)|
|   0 | SELECT STATEMENT     |             |     1 |       |    18   (6)|
|   1 |  SORT AGGREGATE      |             |     1 |       |            |
|   2 |   FIXED TABLE FULL   | X$KQLFXPL   |   100 |       |            |

Running latchprofx script from another session concurrently shows that accessing this x$ table will result in excessive library cache latching activity. Row estimates column for this table shows 100 which is a CBO default cardinality estimate if there are no statistics on a table.

@/tmp/latchprofx  sid,name,laddr % library 10000

   SID NAME                                LADDR                  Held  Held %     Held ms
------ ----------------------------------- ---------------- ---------- ------- -----------
   127 library cache                       0000000990FED2B8         26     .26      68.120
   127 library cache                       0000000990FEFEC8         24     .24      62.880
   127 library cache                       0000000990FEF658         23     .23      60.260
   127 library cache                       0000000990FEE668         22     .22      57.640
   127 library cache                       0000000990FEFA18         22     .22      57.640


Now, a little bit background is necessary. What started all this? As a part of 10g upgrade planning, Client DBAs had setup a job to capture costly SQLs and store execution plans from shared_pool using dynamic_plan_table script, borrowed from Tom kyte dynamic_plan_table discussion. Excellent idea and wonderfully written sql, Tom.

If you don’t know already, Tom’s method creates following view. Client script passes the hash_value of costly SQLs from v$sql to this view.

create or replace view dynamic_plan_table
     rawtohex(address) || '_' || child_number statement_id,
     sysdate timestamp, operation, options, object_node,
     object_owner, object_name, 0 object_instance,
     optimizer,  search_columns, id, parent_id, position,
     cost, cardinality, bytes, other_tag, partition_start,
     partition_stop, partition_id, other, distribution,
   cpu_cost, io_cost, temp_space, access_predicates,
    from v$sql_plan

REM there is a wrapper script calling following SQL, in a loop, for top 20 SQL statements at a stretch. 
REM Of course, there is more logic in that script.

select plan_table_output from TABLE( dbms_xplan.display ( 'backup.dynamic_plan_table',
                (select rawtohex(address)||'_'||child_number x from v$sql where hash_value=&hv ), 'serial' ) )

Unfortunately, such an awesome method has side effects in version, as explained above.

At this point, I am almost sure that we need to collect statistics on fixed tables to resolve this issue. There is a fixed index on x$kqlfxpl (KQLFXPL_HADD, KQLFXPL_PHAD,KQLFXPL_HASH) and if that index is used, effect of this problem should disappear. Since there is a plan to upgrade to 10g soon, client didn’t want to collect fixed statistics.

So, I worked around the issue. Not an elegant solution, but it avoids problem by recreating the view with hardcoded hash value for every loop.

create or replace view backup.dynamic_plan_table
     rawtohex(address) || '_' || child_number statement_id,
     sysdate timestamp, operation, options, object_node,
     object_owner, object_name, 0 object_instance,
     optimizer,  search_columns, id, parent_id, position,
     cost, cardinality, bytes, other_tag, partition_start,
     partition_stop, partition_id, other, distribution,
   cpu_cost, io_cost, temp_space, access_predicates,
    from v$sql_plan 
   where hash_value=&&hv    -- Added this line..

select plan_table_output from TABLE( dbms_xplan.display ( 'backup.dynamic_plan_table',
                (select rawtohex(address)||'_'||child_number x from v$sql where hash_value=&&hv ), 'serial' ) )

Now, it is evident that query runs in 0.03 seconds and also used fixed index.

     51     NESTED LOOPS OUTER (cr=261 r=0 w=0 time=1327 us)
     51      NESTED LOOPS OUTER (cr=131 r=0 w=0 time=892 us)
     51       FIXED TABLE FIXED INDEX X$KQLFXPL (ind:3) (cr=0 r=0 w=0 time=496 us)
     27       TABLE ACCESS BY INDEX ROWID OBJ$ (cr=131 r=0 w=0 time=327 us)
     27        INDEX UNIQUE SCAN I_OBJ1 (cr=104 r=0 w=0 time=212 us)(object id 33)
     27      TABLE ACCESS CLUSTER USER$ (cr=130 r=0 w=0 time=359 us)
     27       INDEX UNIQUE SCAN I_USER# (cr=28 r=0 w=0 time=78 us)(object id 11)


10g upgrade guide recommends collecting fixed table statistics. So, this issue should completely disappear in 10g.

Posted in CBO, Oracle database internals, Performance tuning | Tagged: , , , , , , , , | 11 Comments »

Systemstate dump analysis: Nocache on high intensive sequences in RAC

Posted by Riyaj Shamsudeen on March 10, 2009

Sequence values are cached in instance memory and by default 20 values are cached. As instance cache is transient, loss of an instance can result in loss of cached sequence values. Permanent record of highest possible value from any instance is kept track in SEQ$ table.

SQLs accessing this sequence within an instance will access instance SGA. As each instance caches its own sequence values it is highly likely that SQLs accessing this sequence from different instance will create gaps in sequence values.

Common knee-jerk reaction to this issue is to set nocache or cache 1 for these sequences. In a single instance environment, this approach will backfire due to massive updates to SEQ$ tables, buffer busy waits, latch free waits etc. In a RAC environment, this issue is magnified and almost hangs the instance. I had the privilege of working with a client to resolve one of their performance issues.


Instances were hung. It was not possible to login to the database. Many existing connections are working fine though. We were lucky enough that one of the DBAs had active connection to the database. So, we took systemstate dump from that session to see whey there is an hang (or slowness).

Alert log was printing following lines at this time frame. So, we know that there is a problem with row cache enqueue.

Wed Feb 18 06:31:31 2008
Wed Feb 18 08:59:31 2008
Wed Feb 18 08:59:31 2008
Wed Feb 18 09:01:00 2008
Wed Feb 18 09:01:00 2008

systemstate dump

We took a systemstate dump with following command.

alter session set events 'immediate trace name systemstate level 4';

Continue Reading

Posted in Oracle database internals, Performance tuning | Tagged: , , , , | 18 Comments »

High global cache waits on tab$ ?

Posted by Riyaj Shamsudeen on February 25, 2009

Yes, you read it correct. That shocked me too.

I was trying to understand global cache waits for a client. Naturally, I queried statspack tables and analyzed the data for just one day using a script. Surprisingly, tab$ came as top consumer of global cache waits. I was shocked by the revelations and couldn’t believe it! If something doesn’t make sense, look more closely, right?

Global cache waits

Database version is 9i. Statspack table stats$seg_stat is the source for this script and that table is populated from v$segment_stats. So, these column values (global_cache_cu_blocks_served and global_cache_cr_blocks_served) are cumulative. To find statistics for a specific period, we need to subtract column value of prior row from current row. Analytic function lag can be useful for this.

In this analytic function printed below, partitioning clause uses instance_number, startup_time and dataobj#. All rows with same value for these three columns will be considered in one data partition and then rows ordered by snap_id within that data partition. Lag will pull the row from prior snap_id in that partition. Then, we subtract from current value to get the difference. Please refer to this paper: Performance tuning with SQL new features – paper for more information about analytic functions.

   global_cache_cu_blocks_served -
   lag(global_cache_cu_blocks_served,1,0) over (partition by instance_number,startup_time, dataobj#, obj#
              order by  snap_id ) global_cache_cu_blocks_served,

Script and output

Complete script printed below and running that against a database.

Continue Reading

Posted in EBS11i, Oracle database internals, Performance tuning | Tagged: , , , , | 6 Comments »

Lock table followed by DDL?

Posted by Riyaj Shamsudeen on January 8, 2009

My client DBA wrote a small script: To drop a column to a busy table. This application is mostly an OLTP application and table is modified heavily. It is hard to get a lock on the table, even though DDL to drop the column will be quick completing within sub-seconds. Following script was tried in the middle of night so as to minimize production impact.

set serveroutput on size 100000
      -- lock table in exclusive mode 
      execute immediate 'lock table t1 in exclusive mode';
      dbms_output.put_line ('**** Table locked. Dropping column ****' ) ;
      -- If we lock it, drop the column
      execute immediate 'alter table t1 drop column n5 ';
  when others then
     dbms_output.put_line(' No luck ');
     dbms_output.put_line(' SQL code'||sqlcode||','|| 'error '||sqlerrm );

Logic of the above script is a) wait for 5 minutes to lock the table in exclusive mode, before timing out. b) If we acquire exclusive lock on that table, then add a column.

Do you see any issues with this script?

So, did we add that column ?
Keep Reading

Posted in 11g, Performance tuning | Tagged: , , , , , , | 16 Comments »

Cardinality feedback to resolve a Cache buffers chains latch contention issue

Posted by Riyaj Shamsudeen on January 2, 2009

Earlier, I blogged about resolving cache buffers chains latch contention in my earlier entry , in which, root cause was excessive index access due to Nested Loops join. Recently, we resolved another similar issue.


CPU usage was very high in production database server in user% mode. Dynamic performance view v$session_wait indicated excessive waits for latch contention. Output from a script wait_details.sql (scripts available at the end of this post) shows that many sessions were waiting for ‘latch free’ event. Also, address for these latch children are the same, meaning all these sessions are trying to access one latch children.

SQL> @wait_details

   SID PID     EVENT         USERNAME  STATE               WAIT_TIME   WIS P1_P2_P3_TEXT
------ ------- ------------- --------- ---------- ------------------- --------- ----- -----------------------------------
    91  24242  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   101   4884  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   116  23899  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   187  19499  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   108  23498  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 3
   194  23701  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   202  26254  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 4
   220  23274  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   227  23643  latch free    CSTMOP    WAITED KNOWN TIME           2     0 address 69476807024-number 98-tries 0
   331  26519  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 0
   297  23934  latch free    CSTMOP    WAITING                     0     0 address 69476807024-number 98-tries 3

We can identify SQL causing latch contention querying v$session_wait. From the output below, SQL with hash_value 1509082258 is suspicious since there are many sessions executing that SQL and waiting / waited recently for ‘latch free’ event.

select substr(w.event, 1, 28 ) event, sql_hash_value, count(*)
from v$session_wait w, v$session s, v$process p
where s.sid=w.sid
and p.addr = s.paddr
and s.username is not null
and event not like '%pipe%'
and event not like 'SQL*%'
group by substr(w.event, 1, 28), sql_hash_value

EVENT                          SQL_HASH_VALUE   COUNT(*)
------------------------------ -------------- ----------
enqueue                               3740270          1
enqueue                             747790152          1
enqueue                            1192921796          1
latch free                          622474477          3
latch free                         1509082258         58 <---
latch free                         1807800540          1
global cache null to x                3740270          1
global cache null to x             1473456670          1
global cache null to x             3094935671          1
db file sequential read             109444956          1

Mapping to object_name
Keep Reading

Posted in CBO, EBS11i, Performance tuning | Tagged: , , , , , , | 7 Comments »

Log file synch tuning #2

Posted by Riyaj Shamsudeen on November 24, 2008

After reading my earlier blog about log file sync tuning a reader sent an interesting issue, worth blogging about it.


Excessive wait time for ‘log file sync’ event while wait time for ‘log file parallel write’ is minimal. See statspack top events below


Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                     % Total
Event                                               Waits    Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
CPU time                                                       18,435    31.06
PL/SQL lock timer                                   4,370      12,137    20.45
log file sync                                      57,939       8,565    14.43  <-----
db file sequential read                         1,303,477       7,030    11.85
db file scattered read                            505,147       3,992     6.73

                                                     Total Wait   wait    Waits
Event                               Waits   Timeouts   Time (s)   (ms)     /txn
---------------------------- ------------ ---------- ---------- ------ --------
log file sync                      57,939      6,571      8,565    148      0.5
log file parallel write           215,112          0      1,138      5      1.9

Average wait time for ‘log file sync’ event is 148ms and it is 5ms for ‘log file parallel write’ event. There is an oddity here. Wait time for ‘log file sync’ event seems bit excessive compared to ‘log file parallel write’ event. Ratio between number of waits for ‘log file sync’ and ‘log file parallel write’ is approximately 4. In an average case scenario, if each commit resulted in 4 log writes, then I would expect an Average wait time of 20ms for ‘log file sync’ event. Wait time of 148ms? Needs further analysis.

pstack, pmap and more

Database is running in a Solaris 10 server and database version Since this is a Solaris operating environment, we can use proc utilities to debug this further. We used a script to dump pstack and pmap of LGWR in a loop. This wasn’t much help, but we could guess that LGWR was waiting for I/O since there were many lines in the stack with kaio and aio wait calls. Few lines from pstack output printed below.

6924:	ora_lgwr_ERPPRD2
-----------------  lwp# 1 / thread# 1  --------------------
 ffffffff7ddd4838 semsys   (4, 3, ffffffff7fffd5cc, 1, ffffffff7fffd5b8)
 00000001005b2694 ksliwat (0, 3c0, 77e8a7790, 1037c0b30, 3, 0) + 274
-----------------  lwp# 2 / thread# 2  --------------------
 ffffffff7ddd3a34 kaio     (6, 0, ffffffff7def7f80, 10, ffffffff7def9c18, ffffffff7f700a00)
-----------------  lwp# 3 / thread# 3  --------------------
 ffffffff7ddd2f1c lwp_park (0, 0, 0)
 ffffffff7ddcc748 cond_wait_queue (1038e5a70, 1038e5a80, 0, 0, ffffffff7def9d08, 0) + 28
 ffffffff7ddccca8 cond_wait (1038e5a70, 1038e5a80, 0, 0, 1c00, 0) + 10
 ffffffff7db03a60 _aio_idle (1038e5a00, 1, 1038e5a80, 0, ffffffff7def9c18, 0) + 28
 ffffffff7db034fc _aio_send_sigev (1038e5a00, 0, 104b60, ffffffff7ddd2cc8, ffffffff7db03498, ffffffff7dc08000) + 64


I/O statistics did not provide much clarity either. We need to find if LGWR is suffering from a performance issue. To see if LGWR is suffering from any OS related issues, we need to trace system calls from LGWR and Truss utility provides such a facility. Suffering from truss related paranoia, we didn’t want to run truss against LGWR since that can affect performance, more importantly database stability.

Fortunately, dtruss came handy. dtruss is based upon dtrace utility and by design, dtrace is safe in Solaris Sparc platform.

Keep Reading

Posted in Oracle database internals, Performance tuning | Tagged: , , , , , | 14 Comments »

Performance issue: High Kernel mode CPU usage

Posted by Riyaj Shamsudeen on October 31, 2008

Recently, we resolved a performance issue with one of our esteemed clients. Problem was interesting and worth blogging about it.

An application makes 300 static connections to database DB1 in database server, say CLNTDB server. Application relies on database link and over a period of time, each session executes SQL through a database link, creating a connection in central database PROD1. So, there are 300 connections in PROD1 database coming from DB1 database through database links. Performance is fine during normal operation.

Problem starts when the application is shutdown. Shutting down the application in DB1, creates massive CPU consumption in PROD1DB server. Unfortunately, this spike in CPU usage lasts for 5-10 seconds and causes ASM heartbeat to fail. Considering that PROD1 is a central database and this has a global effect on application(s) functionality. See presentation below for graphical representation of this problem.

Looking at symptoms in detail, it is visible that CPU usage in sys mode. It is not uncommon to see high CPU usage during a storm of disconnects. But, this specific CPU usage is much higher and uses all CPUs in kernel mode. If this is a problem due to process destruction, then this will show up in CLNTDB server too as there are 300 disconnects there too. But, this problem manifests only in PROD1DB server.

mpstat output from PROD1DB server shows that CPU usage in %sys mode. Notice the numbers below under sys column. almost all CPUs are used in sys mode. We need to drill down further to understand why this much CPUs are used in sys mode. It is also visible that no other columns have any abnormal higher values.

CPU ..intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    554  237  651  219   87  491    0  4349    9  91   0   0
  1    911    0 2412  591  353  630    0 15210   30  63   0   7
  2    313    0  613  106  190  105    0  3562    8  90   0   2
  3    255    0  492   92  161  530    0  2914    6  92   0   2
  4     86    1    2    3    1   63    0     8    0 100   0   0
  5    662    0 1269  153  326  211    0  6753   13  77   0  10
  6    349    0  589   54  170 1534    0  3002    7  88   0   5
  7    317   81  411  105   51   61    0  2484    4  93   0   3
  8   6423 6321  697   36  148  546    0  3663    6  86   0   9
  9    363    0  639   41  154 1893    0  3214    9  85   0   6
 10    456    0  964  194  193   77    0  5797   10  81   0   8
 11    104    0   42    3    9  183    0   207    0  95   0   5
 12    195    0  279  110   31   80    0  1590    3  97   0   0
 13    449    0  844  117  158  127    0  4486    7  85   0   8

Test case
Let’s analyze this issue for a single connection to simplify the problem. Created a test user in DB1 database, created a private database link from DB1 to PROD1 and then executed a select statement over that database link.

In db1:

create user test1 identified by test1;
grant connect, resource to test1;
conn test1/test1
create dtabase ink proddb
connect to test1 identified by test1
using 'proddb';

select * from dual@proddb;

At this point, a database link connection has been created in PROD1 database. Querying v$session in PROD1 database to find session created for database link. SID 4306 is the session created for our test connection from db1 database.

select sid, serial#, LOGON_TIME,LAST_CALL_ET from v$session where
  logon_time > sysdate-(1/24)*(1/60) and machine='machine_name_here'
---------- ---------- -------------------- ------------
      4306      51273 12-SEP-2008 20:47:29            1

Associating this session to v$process, found UNIX process for this session. Now, we will start truss on this UNIX process( Solaris platform).

truss -p -d -o /tmp/truss.log

In DB1 executed: exit

This will disconnect session in PROD1 database.

Truss output
Truss output contains activity generated for disconnect in PROD1 server. Reading through truss output, we can see that system call (shmdt) consumed CPU time. Each shmdt call consumes approximately 24ms from output below (18.5053-18.4807=0.0242). We are using -d flag as above to print time spent in that call. Of course, this is a system call and so this CPU usage will be in kernel mode.

18.4630 close(10)                                       = 0
18.4807 shmdt(0x380000000)                              = 0
18.5053 shmdt(0x440000000)                              = 0
18.5295 shmdt(0x640000000)                              = 0
18.5541 shmdt(0x840000000)                              = 0
18.5784 shmdt(0xA40000000)                              = 0
18.6026 shmdt(0xC40000000)                              = 0
18.6273 shmdt(0xE40000000)                              = 0
18.6512 shmdt(0x1040000000)                             = 0
18.6752 shmdt(0x1240000000)                             = 0
18.6753 shmdt(0x1440000000)                             = 0

So, one disconnect executes 10 system calls and consumes approximately 0.24 CPU seconds in kernel mode. Shmdt calls are used to detach from shared memory segment. Since there are 10 shared memory segments ( as visible in ipcs -ma), 10 shmdt calls are executed per session disconnect.

Projecting this calculation for 300 connections, CPU consumption will be for approximately 72 seconds in total. With 16 CPUs, at least, 5-6 seconds will be used in kernel mode, assuming linear projections[ But, in practice, due to mutex calls and such this may not be linear and will be over 72 seconds]. This is matching with our observation: 5-10 seconds of CPU consumption in kernel mode. First thing, we need to do is reduce # of shmdt calls. One way is to reduce # of shared memory segments.

Shared memory segments
We thought that SHMMAX kernel parameter is limiting since SGA size was bigger than SHMMAX size. After changing SHMMAX parameter, restarting server and such, still many shared memory segments were created.

That’s interesting! TRUSSing instance startup to understand why multiple shared memory segments are created.

Few lines from truss output shows calls to system calls _lgrpsys and pset_bind.

4.5957 munmap(0xFFFFFD7FFDAE0000, 32768)               = 0
4.5958 lgrp_version(1, )                               = 1
4.5958 _lgrpsys(1, 0, )                                = 42
4.5958 _lgrpsys(3, 0x00000000, 0x00000000)             = 19108
4.5959 _lgrpsys(3, 0x00004AA4, 0x06399D60)             = 19108
4.5959 _lgrpsys(1, 0, )                                = 42
4.5960 pset_bind(PS_QUERY, P_LWPID, 4294967295, 0xFFFFFD7FFFDFB11C) = 0
4.5960 pset_info(PS_MYID, 0x00000000, 0xFFFFFD7FFFDFB0D4, 0x00000000) = 0
4.5961 pset_info(PS_MYID, 0x00000000, 0xFFFFFD7FFFDFB0D4, 0x061AA2B0) = 0

_lgrpsys calls indicates that there is some form of NUMA activity going on here. pset_bind is used to bind a thread or process to a specific processor set.

Enter NUMA.

NUMA or Locality groups

Keep Reading

Posted in Oracle database internals, Performance tuning | Tagged: , , , , , , , | 13 Comments »

Tuning ‘log file sync’ wait events

Posted by Riyaj Shamsudeen on July 7, 2008

In this blog entry, we will discuss strategies and techniques to resolve ‘log file sync’ waits. This entry is intended to show an approach based upon scientific principes, not necessarily a step by step guide. Let’s understand how LGWR is inherent in implementing commit mechanism first.

Commit mechanism and LGWR internals

At commit time, process creates a redo record [ containing commit opcodes] and copies that redo record in to log buffer. Then that process signlas LGWR to write contents of log buffer. LGWR writes from log buffer to log file and signals user process back completing a commit. Commit is considered successful after LGWR write is successful.

Of course, there are minor deviation from this general concept such as latching, commits from plsql block or IMU based commit generation etc. But general philosophy still remains the same.

Signals, semaphore and LGWR

Following section introduces internal workings of commit and LGWR interation in unix platform. There are minor implementation
differences between few unix flavors or platform like NT/XP such as use of post wait drivers instead of semaphores etc. This section is to introduce internals, not necessarily dive deep in to internals. Truss is used to trace LGWR and user process to explain here.

truss command used: truss -rall -wall -fall -vall -d -o /tmp/truss.log -p 22459

[ Word of caution, don’t truss LGWR or any background process unless it is absolutely necessary. You can
accidentally cause performance issues, worse yet, shutdown database.]

1. Intiailly, LGWR is sleeping on semaphore using semtimedop or semop call.

22459/1: semtimedop(9, 0xFFFFFD7FFFDFE648, 1, 0xFFFFFD7FFFDFE488) EAGAIN
22459/1:         semnum=15    semop=-1    semflg=0
22459/1:         timeout: 2.060000000 sec
    In the above call,
     9 is semaphore set id visible through ipcs command and semnum=15 is the semaphore for LGWR process in that set.
     next argument is a structure sembuf
	{ unsigned short sem_num;  /* semaphore number */
	  short          sem_op;   /* semaphore operation */
	  short          sem_flg;  /* operation flags */
	third argument is # of semaphores

2. When a session commits, a redo record created and copied in to log buffer. Then that process posts LGWR semaphore
using a semctl call, if LGWR is not active already. Then, process goes to sleep with semtimedop call, in its own semaphore.
Semaphore set id is 9, but semnum is 118 which is for the user process I was tracing.

First semctl calls is posting LGWR. Then process is sleeping on semtimedop call.

27396/1: semctl(9, 15, SETVAL, 1)        = 0
27396/1:semtimedop(9, 0xFFFFFD7FFFDFC128, 1, 0xFFFFFD7FFFDFBF68)=0
27396/1:        semnum=118   semop=-1    semflg=0
27396/1:        timeout: 1.000000000 sec

3. Waiting log writer gets a 0 return code from semtimedop and writes redo records to current redo log file. kaio calls are kernalized asynchronous I/O calls in Solaris platform.

22459/7: pwrite(262, "01 "9E0E i ?".., 1024, 1915904) = 1024
22459/9: pwrite(263, "01 "9E0E i ?".., 1024, 1915904) = 1024
22459/1: kaio(AIOWAIT, 0xFFFFFD7FFFDFE310)               = 1
22459/1:         timeout: 600.000000 sec
22459/9: kaio(AIONOTIFY, 0)                         = 0
22459/7: kaio(AIONOTIFY, 0)                         = 0

4. After successful completion of write(s), LGWR Posts semaphore of waiting process using semctl command.

22459/1: semctl(9, 118, SETVAL, 1)                   = 0

5. User process/Session continues after recieving a return code from semtimedop call, reprinted below.


So, what exactly is ‘log file sync’ wait ?

Commit is not complete until LGWR writes log buffers including commit redo recods to log files. In a nutshell, after posting LGWR to write, user or background processes waits for LGWR to signal back with 1 sec timeout. User process charges this wait time as ‘log file sync’ event.

In the prior section, ‘log file sync’ waits starts at step 2 after semctl call and completes after step 5 above.

Root causes of ‘log file sync’ waits

Root causes of ‘log file sync’, essentially boils down to few scenarios and following is not an exhaustive list, by any means!

1. LGWR is unable to complete writes fast enough for one of the following reasons:
Keep Reading

Posted in Performance tuning | Tagged: , , , | 61 Comments »