Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

Posts Tagged ‘gc buffer busy’

RMOUG Training Days 2015

Posted by Riyaj Shamsudeen on January 19, 2015

I will be talking in Rocky Mountain Oracle User Group Training Days 2015(, with live demos (hopefully there will be no failures in the demo). My topics are:

Feb 17: Deep dive: 3:15PM to 5:15PM – RAC 12c optimization: I will discuss RAC global cache layer in detail with a few demos. You probably can’t find these deep Global Cache layer details anywhere else 🙂

Feb 19: Wednesday: 2:45PM to 3:45PM – Advanced UNIX tools: I will discuss both Solaris and Linux advanced tools to debug deep performance issues.

Feb 19: Wednesday: 12:15PM – 1:15PM – Exadata SIG panel with Alex Fatkulin.

Come to Denver. Come on, it won’t be cold ( I think 🙂 )

Uploading presentation and scripts of the presentation files: RMOUG_2015_Riyaj_RAC_12c_optim files

Presentation files for Unix tools demos: RMOUG_2015_Riyaj_Unix_tools

Posted in 11g, 12c, Presentations, RAC | Tagged: , , , | 1 Comment »

Golden rules of RAC performance diagnostics

Posted by Riyaj Shamsudeen on March 20, 2014

After collaborating with many performance engineers in a RAC database, I have come to realize that there are common pattern among the (mis)diagnosis. This blog about discussing those issues. I talked about this in Hotsos 2014 conference also.

Golden rules

Here are the golden rules of RAC performance diagnostics. These rules may not apply general RAC configuration issues though.

  1. Beware of top event tunnel vision
  2. Eliminate infrastructure as an issue
  3. Identify problem-inducing instance
  4. Review send-side metrics also
  5. Use histograms, not just averages

Looks like, this may be better read as a document. So, please use the pdf files of the presentation and a paper. Presentation slide #10 shows indepth coverage on gc buffer busy* wait events. I will try to blog about that slide later (hopefully).

Golden rules of RAC diagnostics paper

Golden rules of rac diagnostics ppt

Scripts mentioned in the presentation can be downloaded here.


Posted in 11g, Performance tuning, Presentations, RAC | Tagged: , , , , , | 3 Comments »

Dynamic Resource Mastering in 12c

Posted by Riyaj Shamsudeen on February 28, 2014

I blogged about Dynamic Resource Mastering (DRM) in RAC here . DRM freezes the global resources during the reconfiguration event and no new resources can be allocated during the reconfiguration. This freeze has a dramatic effect of inducing huge amount of waits for gc buffer busy [acquire|release] events and other gcs drm freeze release, gcs remaster events. In database version 12c, DRM has been improved further.

A major improvement I see is that not all resources are frozen at any time. Essentially, resources are broken down in to partitions and only a resource partition is frozen. This improvement should decrease the impact of DRM related waits tremendously.

LMON Trace file

Following shows the snippet from the LMON trace file. As you see, only one resource partition is frozen, at-a-time. Resources in the first partition is frozen, completes the resource remastering tasks, and unfreezes that resource partition. Then freezes next resource partition and continues until all resources are remastered.

Read the rest of this entry »

Posted in 12c, Performance tuning, RAC | Tagged: , , , | Leave a Comment »

gc buffer busy acquire vs release

Posted by Riyaj Shamsudeen on April 19, 2012

Last week (March 2012), I was conducting Advanced RAC Training online. During the class, I was recreating a ‘gc buffer busy’ waits to explain the concepts and methods to troubleshoot the issue.


Let’s define these events first. Event ‘gc buffer busy’ event means that a session is trying to access a buffer,but there is an open request for Global cache lock for that block already, and so, the session must wait for the GC lock request to complete before proceeding. This wait is instrumented as ‘gc buffer busy’ event.

From 11g onwards, this wait event is split in to ‘gc buffer busy acquire’ and ‘gc buffer busy release’. An attendee asked me to show the differentiation between these two wait events. Fortunately, we had a problem with LGWR writes and we were able to inspect the waits with much clarity during the class.

Remember that Global cache enqueues are considered to be owned by an instance. From 11g onwards, gc buffer busy event differentiated between two cases:

  1. If existing GC open request originated from the local instance, then current session will wait for ‘gc buffer busy acquire’. Essentially, current process is waiting for another process in the local instance to acquire GC lock, on behalf of the local instance. Once GC lock is acquired, current process can access that buffer without additional GC processing (if the lock is acquired in a compatible mode).
  2. If existing GC open request originated from a remote instance, then current session will wait for ‘gc buffer busy release’ event. In this case session is waiting for another remote session (hence another instance) to release the GC lock, so that local instance can acquire buffer.


Following output should show the differentiation with much clarity.

Read the rest of this entry »

Posted in 11g, Oracle database internals, Performance tuning, RAC | Tagged: , , , , | 23 Comments »

gc buffer busy waits

Posted by Riyaj Shamsudeen on September 27, 2010

If you have the opportunity to work in a RAC environment, you probably encountered (or you will encounter soon 🙂 ) this wait event: ‘GC Buffer busy’. We will explore issues leading to excessive waits for this wait events and how to resolve the issue effectively.

What is a GC buffer busy wait?

In a simple sense, GC buffer busy means that the buffer in the buffer cache, that the session is trying to access is already involved in another ongoing global cache operation. Until that global cache operation completes, session must wait. I will explain this with an example: Let’s say that session #1 is trying to access a block of file #7 block ID 420. That block is in the remote cache and so, session #1 opened a BL lock on that block, requested the remote LMS process to send the block, and waiting for the block shipping to complete. Session #2 comes along shortly thereafter and tries to access the same buffer. But, the block is already involved in a global cache operation and so, session #2 must wait for the session #1 to complete GC (Global Cache) activity before proceeding. In this case, Session #2 will wait for ‘gc buffer busy’ wait event with a time-out and repeatedly tries to access that buffer in a loop.

Consider the scenario if the block is a hot block such as segment header, index branch block or transaction table header block etc. In this case, you can see that many such sessions waiting for the ‘Gc buffer busy’ wait event. This can lead to complex wait scenario quickly as few background processes also can wait for ‘gc buffer busy’ event leading to an eventual database hang situation. If you kill the processes, then pmon might need to access that block to do a rollback, which means that pmon can get stuck waiting for ‘gc buffer busy’ waits too.

Few Scenarios

This wait event can occur for many different reasons, including bugs. For example, I encountered a bug in which the index branch block split can cause excessive ‘gc buffer busy’ waits. So, It is not possible to document all scenarios that can lead to gc buffer busy waits. But, it is worth exploring few most common scenarios, and then discuss a mitigation plans for those scenarios. The methods discussed here will be helpful to understand which types of blocks are involved in this issue too.

1. Right hand growth indexes

Typically, applications generate surrogate keys using a sequence based key generation, an example would be employee_id column in the employee table. These types of unique or primary key columns are usually populated using a sequence generated value with a unique constraint on the column. A unique index may be created to support the unique constraint. Although, it is possible to create a non-unique index to support the unique constraint, and that non-unique index also will suffer from the issues similar to its unique counterparts. This problem is related to more about uniqueness of data and locality of new rows, rather than the type of index.

Indexes store sorted (key[s], ROWID) pair, meaning values in the index are arranged in an ascending or descending key column order. ROWIDs in the (key[s], ROWID) pair points to a specific row in the table segment with that index column value. Also, Indexes are implemented as B-Tree indexes. In the case of unique indexes, on columns populated by sequence based key values, recent entries will be in the right most leaf block of the B-Tree. All new rows will be stored in the right most leaf block of the index. As more and more sessions insert rows in to the table, that right most leaf block will be full. Oracle will split that right most leaf block in to two leaf blocks: One block with all rows except one row and a new block with just one row. (This split, aka Index 90-10 split, needs to modify branch block also ). Now that new leaf block becomes the right most leaf block and the concurrency moves to the new leaf block. Simply put, you can see concurrency issues moving from one block to another block in an orderly fashion.

As you can imagine, all sessions inserting in to the table will insert rows in to the current right most leaf block of the index. This type of index growth termed as “Right Hand Growth” Indexes. As sessions inserts in to the right most leaf block of the index, that index becomes hot block, and concurrency on that leaf block leads to performance issues.

In RAC, this problem is magnified. Sequence cache is instance specific and if the cache is small (defaults to 20), then the right most leaf block becomes hot block, not just in one instance, but across all instances. That hot – right most – leaf block will be transferred back and forth between the instances. If the block is considered busy, then LMS process might induce more delays in transferring the blocks between the instances. While the block is in transit, then the sessions accessing that block must wait on ‘GC buffer busy’ waits and this quickly leads to excessive GC buffer busy waits. Also, immediate branch block of those right most leaf block will play a role in the waits during leaf block splits.

So, how bad can it get? It can be very bad. A complete database hang situation is a possibility. Notice below that over 1000 sessions were waiting for ‘gc buffer busy’ events across the cluster. Application is completely down.

   INST_ID SQL_ID        EVENT              STATE    COUNT(*)
---------- ------------- ------------------ -------- --------
         4 4jtbgawt37mcd gc cr request     WAITING    9
         3 4jtbgawt37mcd gc cr request     WAITING    9
         3 a1bp5ytvpfj48 gc buffer busy    WAITING   11
         4 a1bp5ytvpfj48 gc buffer busy    WAITING   17
         4 14t0wadn1t0us gc buffer busy    WAITING   33
         4 gt1rdqk2ub851 gc buffer busy    WAITING   34
         4 a1bp5ytvpfj48 buffer busy waits WAITING   35
         2 a1bp5ytvpfj48 gc buffer busy    WAITING   65
         1 a1bp5ytvpfj48 gc buffer busy    WAITING  102
         2 7xzqcrdrnyw1j gc buffer busy    WAITING  106
         2 7xzqcrdrnyw1j enq: TX - index c WAITING  173
         1 7xzqcrdrnyw1j gc buffer busy    WAITING  198
         3 7xzqcrdrnyw1j gc buffer busy    WAITING  247
         4 7xzqcrdrnyw1j gc buffer busy    WAITING  247

How do you analyze the problem with right hand key indexes?

First, we need to verify that problem is due to right hand indexes. If you have access to ASH data, it is easy. For all sessions waiting for ‘gc buffer busy’ event query the current_obj#. Following query on ASH can provide you with the object_id involved in these waits. Also, make sure the statement is UPDATE or INSERT statements, not SELECT statement[ SELECT statements are discussed below].

select sample_time,  sql_id, event, current_obj#,sum (cnt)  from  gv$active_session_history
   where sample_time between  to_date ('24-SEP-2010 14:28:00','DD-MON-YYYY HH24:MI:SS') and
     to_date ('24-SEP-2010 14:29:59','DD-MON-YYYY HH24:MI:SS')
    group by  sample_time,  sql_id, event, current_obj#
   order by sample_time

SAMPLE_TIME                             |SQL_ID              |EVENT                                   |CURRENT_OBJ#|  COUNT(*)
26-AUG-10 PM               |14t0wadn1t0us       |gc buffer busy                          |       8366|        33
select owner, object_name, object_type from dba_objects where object_id=8366 or data_object_id=8366;

In this example, current_obj# is 8366, which we can query the dba_objects to find the correct object_id. If this object is an unique index or almost unique index, then you might be running in to a right hand growth indexes.

If you don’t have access to ASH then, you need to sample gv$session_wait (or gv$session from 10g), group by p1, p2 to identify the blocks inducing ‘gc buffer busy’ waits. Then, map those blocks to objects suffering from the issue.

select event,    p1, p2, count(*) from  gv$session s
event ='gc buffer busy' and state='WAITING'
group by event, p1, p2
order by 4

You can also use my script segment_stats_delta.sql to see the objects suffering from ‘gc buffer busy’ waits. See below for an example use:

segment_stats_delta.sql v1.01 by Riyaj Shamsudeen

...Prints Change in segment statistics in the past N seconds.
...Default collection period is 60 seconds.... Please wait for at least 60 seconds...

!!! NOTICE !!! This scripts drops and recreates two types: segment_stats_tbl_type and segment_stats_type.

Following are the available statistics:
Pay close attention to sampled column below. Statistics which are sampled are not exact and so, it might not reflect immediately.

NAME                                                             SAM
---------------------------------------------------------------- ---
logical reads                                                    YES
buffer busy waits                                                NO
gc buffer busy                                                   NO
db block changes                                                 YES
gc cr blocks received                                            NO
Enter value for statistic_name: gc buffer busy
Enter value for sleep_duration: 60
WSH                            | WSH_DELIVERY_DETAILS_U1        | 34
APPLSYS                        | WF_ITEM_ATTRIBUTE_VALUES_PK    | 2
WSH                            | WSH_DELIVERY_DETAILS_U1        | 2
INV                            | MTL_MATERIAL_TRANSACTIONS_U1   | 1
ONT                            | OE_ORDER_LINES_ALL             | 1

PL/SQL procedure successfully completed.

How do you resolve ‘GC buffer busy’ waits due to right hand growth index?

In a simplistic sense, You need to reduce the concurrency on the right most leaf block. There are few options to reduce the concurrency, and hash partitioning that unique index (or almost unique index) is a better solution of all. For example, if we convert the unique index as hash partitioned index with 32 partitions, then you are reducing the concurrency on that right most leaf block by 32 fold. Why? In hash partitioning scheme with 32 partitions, there are 32 index trees and inserts will be spread across 32 right most leaf blocks. In contrast, there is just 1 index tree in the case of non-partitioned index. Essentially, Each partition gets its own index tree. Given an ID column value, that row is always inserted in to a specific partition and that partition is identified by applying hash function over the partitioning column.

In the case of non-partitioned index, for example, values from 1000 to 1010 will be stored in the right most leaf block of the index. In the case of partitioned index with 32 partitions, value 1000 will be stored in partition 24, value 1001 will be stored in partition 19, meaning, values are hashed and spread around 32 partitions leading to improved concurrency. This will completely eliminate Right hand Growth index concurrency issue.

Non-partitioned index

Hash partitioned index with 2 partitions

So, What is the drawback of hash partitioning indexes?

If your query needs to do range scan with a predicate similar to ‘id between 1000 and 1005’, then the index range scan will need to scan 32 index trees, instead of one index tree as in the case of non-partitioned tables. Unless, your application executes these sort of queries millions of times, you probably wouldn’t notice the performance difference. For equality predicate, such as ‘id=:B1’, this is not an issue as the database will need to scan just one index tree.

Let’s discuss about reverse key indexes too. As Michael Hallas and Greg Rahn of real-world performance group said (and I happily concur), reverse key indexes are evil. If you are in Oracle Database version 10g, you can create a partitioned index on a non-partitioned table. So, If your application is suffering from a right hand growth index contention issue, you can convert the non-partitioned index to a hash-partitioned index with minimal risk. So, there aren’t many reasons to use reverse key indexes in 10g [ Remember that range scan is not allowed in the reverse key indexes either]. But, if you are unfortunate enough to support Oracle 9i database, you can NOT create a partitioned index on a non-partitioned table. If your application suffers from right hand growth index concurrency issues in 9i, then your options may be limited to reverse key indexes (or playing with sequences and code change or better yet – upgrade to 10g).

2. What if the statement is a SELECT statement?

It is possible for the SELECT statement to suffer from gc buffer busy waits too. If you encounter a scenario in which the object is an index and the statement is a SELECT statement, then this paragraph applies to you. This issue typically happens if there is higher concurrency on few blocks. For example, excessive index full scan on an index concurrently from many instances can cause ‘gc buffer busy’ waits. In this case, right approach would be tune the SQL statement to avoid excessive access to those index blocks.

In my experience, gc buffer busy waits on SELECT statement generally happens if you have problems with statistics and execution plans. So, verify that execution plan or concurrency didn’t change recently.

3. Freelists, Freelist groups

What if the object ID we queried in Active Session History belongs to a table block and the statement is an INSERT statement? We need to check to see if that block is a segment header block. If there are many concurrent inserts, and if you don’t use ASSM tablespace, then the inserts need to find free blocks. Segment header of an object stores the free list [ if you don’t use freelist groups]. So, concurrent inserts in to a table will induce excessive activity on the Segment header block of that table leading to ‘gc buffer busy’ waits.

Concurrent inserts in to a table with 1 freelist/1 freelist groups will also have contention in a non-segment header block too. When the session(s) searches for a free block in a freelist, all those sessions can get one or two free blocks to insert. This can lead to contention on that block.

Right approach in this case is to increase freelists, free list groups and initrans on those objects (and might need reorg for these parameters to take effect ). Better yet, use ASSM tablespaces to avoid these issues.

4. Other Scenarios

We discussed just few common issues. But, ‘gc buffer busy’ waits can happen for many reasons. Few of them are: CPU starvation issues, Swapping issues, interconnect issues etc. For example, if the process that opened the request for a block did not get enough CPU, then it might not drain the network buffers to copy the buffer to buffer cache. Other sessions accessing that buffer will wait on ‘gc buffer busy’ waits. Or If there is a network issue and the Global cache messages are slower, then it might induce higher gc buffer busy waits too. Statistics ‘gc lost packets’ is a good indicator for network issues, but not necessarily a complete indicator.

As a special case, if the sequences have been kept with lower cache value, then blocks belonging to seq$ table can be an issue too. Increasing cache for highly used sequence should be considered as a mandatory task in RAC.


In summary, next time you encounter this issue, drill down to see which object types and statements are involved. Debug to understand the root cause as ‘gc buffer busy’ waits are usually symptoms. You can read this blog in the traditional format as gc buffer busy waits_orainternals.pdf


Posted in Oracle database internals, Performance tuning, RAC | Tagged: , , , , | 25 Comments »

RAC object remastering ( Dynamic remastering )

Posted by Riyaj Shamsudeen on March 25, 2010

In RAC, every data block is mastered by an instance. Mastering a block simply means that master instance keeps track of the state of the block until the next reconfiguration event (due to instance restart or otherwise).

Hash to the master

These data blocks are mastered in block ranges. For example, range of blocks starting from file 10, block 1 through block 128 may be mastered by instance 1, blocks from file 10, block 129 through 256 are mastered by instance 2 etc. Of course, there are differences between various versions 10g, 11g etc, but Idea here is that block ranges are uniformly mastered between various instances so that Global cache grants are evenly distributed among the instances. Interestingly, length of the block range is 128 from 10g onwards (Julian Dyke mentioned that is 1089 in 9i, but I have not personally tested it). Of course, Support recommends you to unset db_file_multiblock_read_count which will be auto adjusted to 128 which means that Full block range can be read with fewer GC messages, I suppose. I digress.

Further, Michael Möller pointed out that this hash-algorithm is further optimized: The hash-algorithm used when initially computing the master node from the DBA, results in a “virtual master”, which is then translated to a real (online&open) master by a lookup table (the length of which is the maximum number of possible nodes (128 ?). This means that when one node goes off/on-line, RAC does NOT have to recalculate the hash for all blocks, but only distribute the new Hash-to-node table. (One can later visualize dynamic remastering as an additional lookup table between the hash value and node. This table also needs redistributing on node changes.)

Following SQL is helpful in showing masters and owners of the block. This SQL joins, x$kjbl with x$le to retrieve resource name. If you are familiar with Oracle locking strategy, you would probably recognize the format of these cache fusion (aka old PCM) locks. Lock type in this case is BL, id1 is block# and id2 is file_id in this case. Column kjblname2 provides a decimal format lock resource.

Please observe the output below:

  1. Block range: File 1, block 6360-6376 is mastered by node 3 and also owned by node 3.
  2. Block range: File 1, blocks upto 10709 is mastered by instance 0 and owned by instance 3.
  3. Next block range from 10770 is mastered by instance 2 and owned by 3.

Also, note that this output is generated from a database with no remastering done yet.

REM In kjblname2 first entry before ',' is block and seond entry file_id*65536 for BL locks

select kj.*, le.le_Addr from (
select kjblname, kjblname2, kjblowner, kjblmaster, kjbllockp,
substr ( kjblname2,  instr(kjblname2,',')+1,   instr(kjblname2,',',1,2)-instr(kjblname2,',',1,1)-1)/65536 fl,
substr ( kjblname2, 1, instr(kjblname2,',')-1) blk
 from x$kjbl
) kj, x$le le
where le.le_kjbl = kj.kjbllockp
 order by le.le_addr

---------------------- --------------- ---------- ----------- --- ----- ----------------
[0x18d8][0x10000],[BL] 6360,65536,BL           3          3     1  6360 000000038DF9AB08
[0x18e7][0x10000],[BL] 6375,65536,BL           3          3     1  6375 000000038DFBF818 => case #2
[0x18e8][0x10000],[BL] 6376,65536,BL           3          3     1  6376 000000038DFD3BA0
[0x29d1][0x10000],[BL] 10705,65536,BL          3          0     1 10705 00000005D6FE9230
[0x29d5][0x10000],[BL] 10709,65536,BL          3          0     1 10709 000000038EFBB990
[0x2a12][0x10000],[BL] 10770,65536,BL          3          2     1 10770 000000075FFE3C18
[0x2a18][0x10000],[BL] 10776,65536,BL          2          2     1 10776 000000038EFA8488 => case #1 
[0x2a19][0x10000],[BL] 10777,65536,BL          3          2     1 10777 000000038EFB7F90 
[0x2a1a][0x10000],[BL] 10778,65536,BL          1          2     1 10778 000000038EFCC318 

Let’s consider three simple cases of SELECT statement running instance 3:

  1. A session is trying to access the block file 1, block 10776, but that block is mastered by instance 2 and also that block is owned by instance 2 (meaning, it is in instance 2 cache). So, instance 3 will sent a PR (Protected Read) mode BL lock request on that block to instance 2. Ignoring additional complexities, instance 2 will grant PR mode lock to instance 3 and transfer the block to instance 3. Obviously, this involves multiple GC messages, grants and block transfer. Statistics ‘gc remote grants’ gets incremented too.
  2. Let’s consider that session is trying to access another block: file 1, block 6375. That block is mastered by instance 3 and also owned by instance 3. At this point, there is no additional GCS/GES processing is needed and the session pin that buffer and continue the work.
  3. Let’s consider a third case. Session is trying to access file 1 block 6374. That block is not in any buffer cache, but instance 3 is master of the block, so local affinity locks are acquired with minimal GC messages and waits. That block is read from the disk in to the buffer cache

In the case #2 and #3 above, requesting instance also is the master node of a block or block range. In these cases, statistics ‘gc local grants’ is incremented and cheaper local affinity locks on those block ranges are acquired avoiding many Global cache messages.

So far so good, but what if, say instance 1, is reading one object (table, index etc) aggressively, but other instances are not reading that object at all? [ through some sort of application node partitioning or just plain workload]. Does it make sense for the instance accessing that object aggressively request a grant to the remote instance(s) for each OPEN on that object’s blocks? Especially, if the blocks are read in to the buffer cache, but disappears soon from the buffer cache? Wouldn’t that be better if the instance reading that object aggressively is also the master of that object, as in the cases #2 and #3 above?

In addition to that, if the block is supposed to be thrown away from buffer cache (close of BL lock) or if the block needs to be written, then that will involve additional overhead/messaging between the master instance and owner instance since the ownership needs to be communicated back to the master of the block.

Enter Object remastering.

Object Remastering

There are many new features in 10g/11g RAC stack. One of them is Object remastering feature. This feature was implemented in 10gR1 and improved in 10gR2 and further enhanced in 11g. I realize there are parameters in 9i also, but I don’t think it worked as intended.

With object remastering feature, if an object is accessed by an instance aggressively, then that instance will become the master of the object reducing gc remote grants improving performance of the application. In the prior sentence, I used the word “accessed”, but it is a loose term, and the correct term is if the instance is requesting much BL locks on an object, then that object can be remastered. In an ideal world, even if the application is not partitioned, remastering of the objects that were accessed aggressively from one instance will acquire cheaper local instance affinity locks and effective RAC Tax will be minimal.

Well, I said, in an ideal world 🙂 There are few issues here:

  1. Instance do not remember prior mastership across restarts. This means that instance needs to re-learn the object mastership map after every restart. I can see the complexities of remembering the mastership, but it is possible to implement that.
  2. Remastering is not exactly cheap. Instance GRD is frozen during reconfiguration and in a very busy instances, this can take many seconds leading to instance freeze for several seconds. While 10gR2 introduce parallel reconfiguration (_rcfg_parallel_Replay parameter controls this behavior) using all LMS processes to complete the reconfiguration, still, several seconds of freeze is not exactly acceptable in many environments.
  3. I advice my clients to keep LMS processes to a lower value (3 to 5), at the most, but instance reconfiguration effective parallelism is reduced if we reduce number of LMS processes.
  4. Last, but not the least important point is that, default values of few parameters that trigger remastering events are quite low for busy environments causing frequent remastering of objects. In an E-Business World, minor mismanagement in the manager configuration can lead to a massive reconfiguration issues.

Parameters, views and internals

Few parameters are controlling this behavior, not well documented, my test case results are not very accurate either. But, these parameters are giving us a picture of what is going on internally. These parameters are applicable to 10gR2 and below. For 11g, whole set of different parameters comes in to play and I will blog about the differences in another blog entry.

X$object_affinity_statistics maintains the statistics about objects and OPENs on those objects. It is important to understand the difference between OPEN and Buffer access. If the block is in the cache already in a suitable mode, there is no need for BL opens on that block. So, if the sessions are accessing the same block repeatedly without requesting any additional BL locks, then the count is not incremented. So, OPEN is simply a number of BL request initiated in an ephimeral time frame.

LCK0 process maintains these object affinity statistics. If an instance opens 50 more opens on an object then the other instance (controlled by _gc_affinity_limit parameter), then that object is a candidate for remastering. That object is queued and LMD0 reads the queue and initiates GRD freeze. LMON performs reconfiguration of buffer cache locks working with LMS processes. All these are visible in LMD0/LMON trace files. Parameter _gc_affinity_time controls how often the queue is checked to see if the remastering must be triggered or not with a default value of 10 minutes.

Now, you don’t want just any object as a candidate for remastering, meaning, if instance 1 opened 101 BL locks on that object and instance 2 opened 50 BL locks on that object, you don’t want to trigger object remastering. Only objects with higher amount of BL lock requests must be queued for remastering. Well, that threshold seems to be controlled by another parameter _gc_affinity_minimum: This parameter is defined as “minimum amount of dynamic affinity activity per minute” to be a candidate for remastering. Defaults to 2500 and I think, it is lower in a busy environment.

Few lines from LMD0 trace files showing that LMD0 is reading a request queue:

* kjdrchkdrm: found an RM request in the request queue 
  Transfer pkey 6589904 to node 3
*** 2009-10-12 11:41:20.257

How bad can it get?

Performance can suffer if there are remastering issues. Following AWR report shows that few instances froze due to DRM reconfiguration issue. Same type of freeze is visible in all other nodes too. gc buffer busy is a side effect of DRM freeze (not always, but in this case).

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
gc buffer busy                    1,826,073     152,415     83   62.0    Cluster
CPU time                                         30,192          12.3
enq: TX - index contention           34,332      15,535    453    6.3 Concurrenc
gcs drm freeze in enter server       22,789      11,279    495    4.6      Other
enq: TX - row lock contention        46,926       4,493     96    1.8 Applicatio

At the same time, a storm of DRM were started. This lead to repetitive DRM freeze and instance configuration leading to massive performance issues.

* received DRM start msg from 2 (cnt 5, last 1, rmno 404) 
Rcvd DRM(404) Transfer pkey 1598477 from 3 to 2 oscan 1.1
Rcvd DRM(404) Dissolve pkey 6100030 from 2 oscan 0.1
Rcvd DRM(404) Dissolve pkey 5679943 from 2 oscan 0.1
Rcvd DRM(404) Transfer pkey 6561036 from 0 to 2 oscan 1.1
Rcvd DRM(404) Transfer pkey 5095243 to 2 oscan 0.1

A small test case

Let’s walk through a test case that shows DRM in play. Query used index access path to read nearly all blocks from a big index.

Session #1: 
select data_object_id from dba_objects where object_name='WMS_N1';

REM No affinity statistics yet.
select * from x$object_affinity_statistics where object=6099472;
no rows selected.
REM executing a costly select statement
select /*+ index(a WMS_N1 */ count(*) from big_table1 a;

Session #2: I was monitoring the DRM tables:
REM DRM operations completed so far is 409. We will keep track of this count to see the remastering count intereasing. There are few
REM other interesting fields in this view.
select drms from X$KJDRMAFNSTATS;
REM  I see that 23442 opens on that index already since the session #1 started running. 
select * from x$object_affinity_statistics where object=6099472;
ADDR                   INDX    INST_ID     OBJECT       NODE      OPENS
---------------- ---------- ---------- ---------- ---------- ----------
FFFFFFFF7C05BFA8         14          1    6099472          1      23442
REM No mastering has kicked in for that object.
select * from v$gcspfmaster_info where object_id=6099472;
no rows selected
REM few seconds later, open count increased from 23442 -> 33344.
select * from x$object_affinity_statistics where object=6099472;
ADDR                   INDX    INST_ID     OBJECT       NODE      OPENS
---------------- ---------- ---------- ---------- ---------- ----------
FFFFFFFF7C05BFA8         16          1    6099472          1      33344

REM No remastering has kicked in for that object. 
select * from v$gcspfmaster_info where object_id=6099472;
no rows selected

REM Surprisingly, while session #1 is still executing, the counter for OPENS was zeroed out even when though DRM has not been triggered yet.
REM  OPENS Increased to 1229 from 0 since the session #1 is still executing.
ADDR                   INDX    INST_ID     OBJECT       NODE      OPENS
---------------- ---------- ---------- ---------- ---------- ----------
FFFFFFFF7C05BFA8          0          1    6099472          1       1229
REM Approximately 10 minutes or so later, Remastering kicked in..
REM # of DRMS increased from 409 to 411
select drms from X$KJDRMAFNSTATS;
REM Remastering of that index happened. Now the master is 0 which is instance 1.
select * from v$gcspfmaster_info where object_id=6099472;
---------- ---------- -------------- --------------- ------------
         0    6099472              0           32767            1
REM Opens are still increasing but remastering has already occurred.
select * from x$object_affinity_statistics where object=6099472;
ADDR                   INDX    INST_ID     OBJECT       NODE      OPENS
---------------- ---------- ---------- ---------- ---------- ----------
FFFFFFFF7C05AF78          2          1    6099472          1      42335
FFFFFFFF7C05AF78          3          1    6099472          2          1
REM LMON trace files are also indicating transfer of that pkey. Notice that pkey is same as object_id
*** 2010-03-23 10:41:57.355
Begin DRM(411) - transfer pkey 6099472 to 0 oscan 0.0
 ftd received from node 1 (4/0.30.0)
 all ftds received
REM few seconds later, opens have been reset again.
select * from x$object_affinity_statistics where object=6099472;
ADDR                   INDX    INST_ID     OBJECT       NODE      OPENS
---------------- ---------- ---------- ---------- ---------- ----------
FFFFFFFF7C05BFA8          0          1    6099472          1       7437
REM Still the master is instance 1. 
select * from v$gcspfmaster_info where object_id=6099472;
---------- ---------- -------------- --------------- ------------
         0    6099472              0           32767            1

Essentially, an object was remastered after excessive BL locking requests ( in a loose term accesses) on that index.

undo and affinity

Mastering of Undo segments differ from non-undo segment mastering. With non-undo segments, all the blocks are mastered by a hash technique spreading mastership among instances for a segment. Only after an instance opens BL locks aggressively on a segment that segment is mastered. But, for undo segments, Instance that activates an undo segment masters the segment immediately. This makes sense, since that undo segment will be used by the instance opening the segment in most cases. Parameter _gc_undo_affinity controls whether this dynamic undo remastering is enabled or not.

Since undo segments do not have real object_ids, a dummy object_ids over a value of 4294950912 is used. For example, undo segment 1 (with usn=1) will have an object_id of 4294950913, usn=2 will have object_id of 4294950914 etc.
[ 4294950912 = power(2,32) – power (2,14) = xFFFFC000 ]

select objecT_id, object_id-4294950912 usn, current_master, previous_master, 
remaster_cnt  from v$gcspfmaster_info where object_id>4294950912

---------- ---------- -------------- --------------- ------------
4294950913          1              0           32767            1
4294950914          2              0           32767            1
4294950915          3              0           32767            1
4294950916          4              0           32767            1
4294950917          5              0           32767            1
4294950918          6              0           32767            1
4294950919          7              0           32767            1
4294950920          8              0           32767            1
4294950921          9              0           32767            1
4294950922         10              0           32767            1
4294950923         11              1           32767            1
4294950924         12              1           32767            1
4294950925         13              1           32767            1

REM Notice that usno 0 is in both instances. That is system undo segment.
REM As you can see below first 10 undo segments are mastered by node 1 is next 3 are mastered by instance 2.

select inst_id, usn, gets from gv$rollstat where usn <=13
order by inst_id, usn
   INST_ID        USN       GETS
---------- ---------- ----------
         1          0       3130
         1          1     108407
         1          2      42640
         1          3      43985
         1          4      41743
         1          5     165166
         1          6      43485
         1          7     109044
         1          8      23982
         1          9      39279
         1         10      48552
         2          0       4433
         2         11     231147
         2         12      99785
         2         13      1883

I was not successful in triggering another undo segment remastering event. I created one active transaction generating 200K undo blocks in one node, another node was reading that table and I can see enormous waits for those undo blocks. But, I didn’t see any remastering events related to that undo segment. Not sure why it did not work, may be the conditions for the undo segment remastering is different.

[ PS: I am able to manually remaster the undo segment using lkdebug command discussed below: So, code must be remastering the undo segments automatically too, but may be some other conditions must be met.

  1* select * from v$gcspfmaster_info where object_id=431+4294950912
---------- ---------- -------------- --------------- ------------
         0 4294951343              0           32767            1
Oradebug lkdebug –m pkey 4294951343

* kjdrchkdrm: found an RM request in the request queue
  Transfer pkey 4294951343 to node 1
*** 2010-03-24 12:47:29.011
Begin DRM(520) - transfer pkey 4294951343 to 1 oscan 0.1
 ftd received from node 0 (4/0.31.0)
 all ftds received

select * from v$gcspfmaster_info where object_id=431+4294950912
SQL> /
---------- ---------- -------------- --------------- ------------
         0 4294951343              1               0            2

I am not preaching that you should modify these undocumented parameters. Far from it. Understand the parameters, if you run in to wait events such as ‘gc remaster’, ‘gcs freeze for instance reconfiguration’, understand whether the default values are quite low. Work with support and see if this can be tuned.

Manual remastering

You can manually remaster an object with oradebug command :
oradebug lkdebug -m pkey

This enqueues an object remaster request. LMD0 and LMON completes this request
*** 2010-01-08 23:25:54.948
* received DRM start msg from 1 (cnt 1, last 1, rmno 191)
Rcvd DRM(191) Transfer pkey 6984154 from 0 to 1 oscan 0.0
 ftd received from node 1 (8/0.30.0)
 ftd received from node 0 (8/0.30.0)
 ftd received from node 3 (8/0.30.0)
Current_master starts from 0. 
 1* select * from v$gcspfmaster_info where object_id=6984154
SQL> /
---------- ---------- -------------- --------------- ------------
         0    6984154              1               0            2
SQL> oradebug lkdebug -m pkey 6984154
Statement processed.
SQL>  select * from v$gcspfmaster_info where object_id=6984154
---------- ---------- -------------- --------------- ------------
         0    6984154              2               1            3


In summary, remastering is a great feature. It is a pity that some times, we fall victims of the side effects. So, if you run in to issues with remastering, don’t disable it, but see if you can tune those parameter upwards so as to control the remastering events. If you stil want to disable DRM completely, I would recommend setting _gc_affinity_limit and _gc_affinity_minimum to much higher value, say 10Million. Setting the parameter _gc_affinity_time to 0 will completely disable DRM, but that also means that you can not manually remaster objects. Further, Arup mentioned that x$object_affinity_statistics is not maintained if DRM is disabled.

Again, these are undocumented parameters. Before you change these parameters make sure that Oracle Support agrees with you.

Update 1:

From 11g onwards, affinity management renamed to policy management. For example, x$object_affinity_statistics is renamed to x$object_policy_statistics. Similarly, initialization parameters are renamed too:Parameter _gc_affinity_limit is renamed to _gc_policy_limit; Parameter _gc_affinity_time is renamed to _gc_policy_time; A new view v$policy_history is introduced and all rows with policy_event = ‘initiate_affinity’ are associated with DRM events.
Other details about the blog still holds good, except that by default _gc_policy_limit is lowered to 1500 which would mean that, by theory, 11g may have more DRM events. YMMV.

 1* select * from  v$policy_history
---------- -------------------- -------------- ----------------------  --------------------
         2 glru_on                           0                      1  10/15/2010 10:58:28
         2 glru_on                           0                      1  10/15/2010 11:21:32
         2 initiate_affinity             74809                      1  10/15/2010 13:27:44

[ Many Thanks Arup Nanda and Michael Möller (aka “M2”) for reviewing this blog entry, they contributed heavily to some of my discussion. But, any mistakes in this document is solely of mine. ]
This blog is available in a pdf format from Dynamic_remastering_RAC_orainternals.

Posted in Oracle database internals, Performance tuning, Presentations, RAC | Tagged: , , , , , , , , , , , , , , , , , , | 34 Comments »