Recently, I had an opportunity to tune latch contention for cache buffers chain (CBC) latches. Problem statement is that high CPU usage combined with poor application performance. Quick review of statspack report of 15 minutes showed a latch free wait as top event and consuming 3600 seconds approximately, in a 8 CPU server. Further CPU usage was quite high, which is a typical symptom of latch contention, due to spinning involved. v$session_wait showed that hundreds of sessions were waiting for latch free event.
SQL> @waits10g SID PID EVENT P1_P2_P3_TEXT ------ ------- ------------ -------------------------------------- 294 17189 latch free address 15873156640-number 127-tries 0 628 17187 latch free address 15873156640-number 127-tries 0 .... 343 17191 latch free address 15873156640-number 127-tries 0 599 17199 latch: cache address 17748373096-number 122-tries 0 buffers chains 337 17214 latch: cache address 17748373096-number 122-tries 0 buffers chains ..... 695 17228 latch: cache address 17748373096-number 122-tries 0 buffers chains .... 276 15153 latch: cache address 19878655176-number 122-tries 1 buffers chains
We will use two pronged approach to find root cause scientifically. First, we will find SQL suffering from latch contention and objects associated with access plan for that SQL. Next, we will find buffers involved in latch contention, map that back to objects. Finally, we will match these two techniques to pinpoint root cause.
Before we go any further, let’s do a quick summary of internals of latch operations.
Brief Introduction to CBC latches and not-so-brief reason why this is a complicated topic to discuss briefly
Latches are internal memory structures to coordinate access to shared resources. Locks aka enqueues are different from latches. Key difference is that enqueues, as name suggests, provides a FIFO queueing mechanisms and latches do not provide a queueing mechanism. On the other hand, latches are held very briefly and locks are usually held longer.
In Oracle SGA, buffer cache is the memory area data blocks are read in to, aka buffer cache. [If ASMM - Automatic Shared Memory Management is in use, then part of Shared pool can be tagged as KGH:NO ALLOC and remapped to buffer cache area too].
Each buffer in the buffer cache has an associated element the buffer header array, externalized as x$bh. Buffer headers keeps track of various attributes and state of buffers in the buffer cache. This Buffer header array is allocated in shared pool. These buffer headers are chained together in a doubly linked list and linked to a hash bucket. There are many hash buckets (# of buckets are derived and governed by _db_block_hash_buckets parameter). Access (both inspect and change) to these hash chains are protected by cache buffers chains latches.
Further, buffer headers can be linked and delinked from hash buckets dynamically.
Simple algorithm to access a buffer is: (I had to deliberately cut out so as not to deviate too much from our primary discussion.)
- Hash data block address (DBA: Combination of tablespace, file_id and block_id) to find hash bucket.
- Get latch protecting hash bucket.
- If (success) then Walk the hash chain reading buffer headers to see if a specific version of the block is already in the chain.
- If (not success) spin for spin_count times and go to step 2.
- If this latch was not got with spinning, then sleep, with increasing exponential back-off sleep time and go to step 2.
If found, access the buffer in buffer cache, with protection of buffer pin/unpin actions.
If not found, then find a free buffer in buffer cache, unlink the buffer header for that buffer from its current chain, link that buffer header with this hash chain, release the latch and read block in to that free buffer in buffer cache with buffer header pinned.
Obviously, latches are playing crucial role controlling access to critical resources such as hash chain. My point is that repeated access to few buffers can increase latch activity.
There are many CBC latch children (derived by size of buffer cache). Parameter _db_block_hash_latches control # of latches and derived based upon buffer cache size. Further, In Oracle 10g, sharable latches are used and inspecting an hash chain needs to acquire latches in share mode, which is compatible with other shared mode operations. Note that these undocumented parameters are usually sufficient and changes to these parameters must get approval from Oracle support.
Back to our problem…
Let’s revisit our problem at hand. Wait graph printed above shows that this latch contention is caused by two types of latches. Latch # 127 is simulator lru latch and #122 is cache buffers chains latch.