Performance issue: High Kernel mode CPU usage

Posted by Riyaj Shamsudeen on October 31, 2008

Recently, we resolved a performance issue with one of our esteemed clients. Problem was interesting and worth blogging about it.

Problem
An application makes 300 static connections to database DB1 in database server, say CLNTDB server. Application relies on database link and over a period of time, each session executes SQL through a database link, creating a connection in central database PROD1. So, there are 300 connections in PROD1 database coming from DB1 database through database links. Performance is fine during normal operation.

Problem starts when the application is shutdown. Shutting down the application in DB1, creates massive CPU consumption in PROD1DB server. Unfortunately, this spike in CPU usage lasts for 5-10 seconds and causes ASM heartbeat to fail. Considering that PROD1 is a central database and this has a global effect on application(s) functionality. See presentation below for graphical representation of this problem.

Symptoms
Looking at symptoms in detail, it is visible that CPU usage in sys mode. It is not uncommon to see high CPU usage during a storm of disconnects. But, this specific CPU usage is much higher and uses all CPUs in kernel mode. If this is a problem due to process destruction, then this will show up in CLNTDB server too as there are 300 disconnects there too. But, this problem manifests only in PROD1DB server.

mpstat output from PROD1DB server shows that CPU usage in %sys mode. Notice the numbers below under sys column. almost all CPUs are used in sys mode. We need to drill down further to understand why this much CPUs are used in sys mode. It is also visible that no other columns have any abnormal higher values.


CPU ..intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    554  237  651  219   87  491    0  4349    9  91   0   0
  1    911    0 2412  591  353  630    0 15210   30  63   0   7
  2    313    0  613  106  190  105    0  3562    8  90   0   2
  3    255    0  492   92  161  530    0  2914    6  92   0   2
  4     86    1    2    3    1   63    0     8    0 100   0   0
  5    662    0 1269  153  326  211    0  6753   13  77   0  10
  6    349    0  589   54  170 1534    0  3002    7  88   0   5
  7    317   81  411  105   51   61    0  2484    4  93   0   3
  8   6423 6321  697   36  148  546    0  3663    6  86   0   9
  9    363    0  639   41  154 1893    0  3214    9  85   0   6
 10    456    0  964  194  193   77    0  5797   10  81   0   8
 11    104    0   42    3    9  183    0   207    0  95   0   5
 12    195    0  279  110   31   80    0  1590    3  97   0   0
 13    449    0  844  117  158  127    0  4486    7  85   0   8

Test case
Let’s analyze this issue for a single connection to simplify the problem. Created a test user in DB1 database, created a private database link from DB1 to PROD1 and then executed a select statement over that database link.

In db1:

create user test1 identified by test1;
grant connect, resource to test1;
conn test1/test1
create dtabase ink proddb
connect to test1 identified by test1
using 'proddb';

select * from dual@proddb;

At this point, a database link connection has been created in PROD1 database. Querying v$session in PROD1 database to find session created for database link. SID 4306 is the session created for our test connection from db1 database.

select sid, serial#, LOGON_TIME,LAST_CALL_ET from v$session where
  logon_time > sysdate-(1/24)*(1/60) and machine='machine_name_here'
/
       SID    SERIAL# LOGON_TIME           LAST_CALL_ET
---------- ---------- -------------------- ------------
      4306      51273 12-SEP-2008 20:47:29            1

Associating this session to v$process, found UNIX process for this session. Now, we will start truss on this UNIX process( Solaris platform).

truss -p -d -o /tmp/truss.log

In DB1 executed: exit

This will disconnect session in PROD1 database.

Truss output
Truss output contains activity generated for disconnect in PROD1 server. Reading through truss output, we can see that system call (shmdt) consumed CPU time. Each shmdt call consumes approximately 24ms from output below (18.5053-18.4807=0.0242). We are using -d flag as above to print time spent in that call. Of course, this is a system call and so this CPU usage will be in kernel mode.


18.4630 close(10)                                       = 0
18.4807 shmdt(0x380000000)                              = 0
18.5053 shmdt(0x440000000)                              = 0
18.5295 shmdt(0x640000000)                              = 0
18.5541 shmdt(0x840000000)                              = 0
18.5784 shmdt(0xA40000000)                              = 0
18.6026 shmdt(0xC40000000)                              = 0
18.6273 shmdt(0xE40000000)                              = 0
18.6512 shmdt(0x1040000000)                             = 0
18.6752 shmdt(0x1240000000)                             = 0
18.6753 shmdt(0x1440000000)                             = 0

So, one disconnect executes 10 system calls and consumes approximately 0.24 CPU seconds in kernel mode. Shmdt calls are used to detach from shared memory segment. Since there are 10 shared memory segments ( as visible in ipcs -ma), 10 shmdt calls are executed per session disconnect.

Projecting this calculation for 300 connections, CPU consumption will be for approximately 72 seconds in total. With 16 CPUs, at least, 5-6 seconds will be used in kernel mode, assuming linear projections[ But, in practice, due to mutex calls and such this may not be linear and will be over 72 seconds]. This is matching with our observation: 5-10 seconds of CPU consumption in kernel mode. First thing, we need to do is reduce # of shmdt calls. One way is to reduce # of shared memory segments.

Shared memory segments
We thought that SHMMAX kernel parameter is limiting since SGA size was bigger than SHMMAX size. After changing SHMMAX parameter, restarting server and such, still many shared memory segments were created.

That’s interesting! TRUSSing instance startup to understand why multiple shared memory segments are created.

Few lines from truss output shows calls to system calls _lgrpsys and pset_bind.


4.5957 munmap(0xFFFFFD7FFDAE0000, 32768)               = 0
4.5958 lgrp_version(1, )                               = 1
4.5958 _lgrpsys(1, 0, )                                = 42
4.5958 _lgrpsys(3, 0x00000000, 0x00000000)             = 19108
4.5959 _lgrpsys(3, 0x00004AA4, 0x06399D60)             = 19108
4.5959 _lgrpsys(1, 0, )                                = 42
4.5960 pset_bind(PS_QUERY, P_LWPID, 4294967295, 0xFFFFFD7FFFDFB11C) = 0
4.5960 pset_info(PS_MYID, 0x00000000, 0xFFFFFD7FFFDFB0D4, 0x00000000) = 0
4.5961 pset_info(PS_MYID, 0x00000000, 0xFFFFFD7FFFDFB0D4, 0x061AA2B0) = 0

_lgrpsys calls indicates that there is some form of NUMA activity going on here. pset_bind is used to bind a thread or process to a specific processor set.

Enter NUMA.

NUMA or Locality groups

NUMA stands for Non Uniform Memory Access. In high end SMP systems backplane bus can become a bottleneck. In non-NUMA platforms all memory access must go through system bus and system bus can reduce scalability. NUMA technology (aka ccNUMA) was introduced to relieve symptoms of this bottleneck. Every processor has its own local memory and connected by some form of hardware interconnect. A differentiation is also made between local and remote memory in NUMA based servers. Access to local memory has less latency compared to access to remote memory from a processor set. There is enormous amount of literature available about NUMA. Refer Wiki entry for NUMA and Kevin Closson blog and also my presentation below.

In Solaris, NUMA is implemented as locality groups. In a oversimplified way, a process running in a locality group will have much less latency accessing memory local to that group. As the “remoteness” of memory increases, latency also increases. In future releases of Solaris, this locality might be applied to other resources such as I/O.

To optimally use NUMA technology, Oracle code is spreading SGA in to all locality groups. Then a DBWR is assigned to that locality group. pset_bind calls binds that DBWR to a specific locality group or processor set. [ I think, LGWR is also assigned and bound to a processor set, but I have no confirmation of that ].

Now, we understand that why there are 10 shared memory segments for this SGA.

I was curious, with 12 CPUs in the server, do we really have these many locality groups defined? or Is there some kind of bug that we are hitting ?

Locality group
To find locality group, lgrpinfo can be used to get locality group details. We found identical machine architecture, and installed observability tools from Observability tools for NUMA .

Output of lgrpinfo shows locality groups defined in that server.

#/usr/local/bin/lgrpinfo
lgroup 0 (root):
     Children: 10 12 14 15 17 19 21 23
     CPUs: 0-15
     Memory: installed 65024 Mb, allocated 2548 Mb, free 62476 Mb
     Lgroup resources: 1-8 (CPU); 1-8 (memory)?
     Latency: 146
lgroup 1 (leaf):
     Children: none, Parent: 9 <-- leaf node
     CPUs: 0 1
     Memory: installed 7680 Mb, allocated 1964 Mb, free 5716 Mb
     Lgroup resources: 1 (CPU); 1 (memory)?
     Load: 0.105
     Latency: 51
...
lgroup 9 (intermediate):
     Children: 1, Parent: 10
     CPUs: 0-5
     Memory: installed 24064 Mb, allocated 2187 Mb, free 21877 Mb
     Lgroup resources: 1-3 (CPU); 1-3 (memory)
     Latency: 81
...
lgroup 10 (intermediate):
     Children: 9, Parent: 0  <-- intermediate group
     CPUs: 0-13
     Memory: installed 56832 Mb, allocated 2491 Mb, free 54341 Mb
     Lgroup resources: 1-7 (CPU); 1-7 (memory)
     Latency: 113
...

There are many locality groups defined in Solaris. Local groups are defined as an hierarchy and 7 of them are leaf node local groups. In the output above, lgroup 0 is the root group. Intermediate groups are 10,12,14,15,17,19,23.

For e.g., for lgroup 1, hierarchy is : 1 -> 9 -> 10 -> 0. 0 is the root group here.

locality group latencies can be easily seen lgrpinfo -l command. From output below, it is visible that access to local memory has lower latency (for example accessing memory area 1 from locality group 1 is 51), whereas access to remote memory is 146 for accessing locality group 24. These numbers are not actual latency, merely a relative representation of latency.

--------------------------------------------------------------------------------------------------------
   |   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24
--------------------------------------------------------------------------------------------------------
0  | 146 146 113 113 113 113 113 113 146 146 146 146 146 146 146 146 113 146 113 146 146 146 146 146 146
1  | 146  51  81  81 113 113 113 113 146  81 113 113 146 113 146 146 113 146 113 146 146 146 146 146 146
2  | 113  81  51 113  81 113 113  81 113 113 113  81 113 113 113 113 113 113 113 113 113 113 113 113 113
3  | 113  81 113  51 113  81  81 113 113 113 113 113 113  81 113 113 113 113 113 113 113 113 113 113 113
4  | 113 113  81 113  51  81  81 113 113 113 113 113 113 113 113 113  81 113 113 113 113 113 113 113 113
5  | 113 113 113  81  81  51 113  81 113 113 113 113 113 113 113 113 113 113  81 113 113 113 113 113 113
6  | 113 113 113  81  81 113  51 113  81 113 113 113 113 113 113 113 113 113 113 113  81 113 113 113 113
7  | 113 113  81 113 113  81 113  51  81 113 113 113 113 113 113 113 113 113 113 113 113 113  81 113 113
8  | 146 146 113 113 113 113  81  81  51 146 146 146 146 146 146 146 113 146 113 146 113 146 113 113  81
9  | 146  81 113 113 113 113 113 113 146  81 113 113 146 113 146 146 113 146 113 146 146 146 146 146 146
10 | 146 113 113 113 113 113 113 113 146 113 113 113 146 113 146 146 113 146 113 146 146 146 146 146 146
11 | 146 113  81 113 113 113 113 113 146 113 113  81 146 113 146 146 113 146 113 146 146 146 146 146 146
12 | 146 146 113 113 113 113 113 113 146 146 146 146 113 146 146 146 113 146 113 146 146 146 146 146 146
13 | 146 113 113  81 113 113 113 113 146 113 113 113 146  81 146 146 113 146 113 146 146 146 146 146 146
14 | 146 146 113 113 113 113 113 113 146 146 146 146 146 146 113 146 113 146 113 146 146 146 146 146 146
15 | 146 146 113 113 113 113 113 113 146 146 146 146 146 146 146 113 113 146 113 146 146 146 146 146 146
16 | 113 113 113 113  81 113 113 113 113 113 113 113 113 113 113 113  81 113 113 113 113 113 113 113 113
17 | 146 146 113 113 113 113 113 113 146 146 146 146 146 146 146 146 113 113 113 146 146 146 146 146 146
18 | 113 113 113 113 113  81 113 113 113 113 113 113 113 113 113 113 113 113  81 113 113 113 113 113 113
19 | 146 146 113 113 113 113 113 113 146 146 146 146 146 146 146 146 113 146 113 113 146 146 146 146 146
..

Recap
High kernel mode CPU usage is due to excessive # of calls to shmdt. There are 10 shmdt calls per disconnect since there are 10 shared memory segments. SGA is spread across NUMA nodes creating these many segments.

Summary

Indeed 10 shared memory segments were created to use NUMA technology. NUMA is an excellent technology and it is a pity that we are suffering from a side effect of NUMA. We need to resolve this issue and We have couple of options at this point:

1. We can disable NUMA completely setting Parameter _enable_numa_optimization to false.
2. Reduce NUMA nodes using _db_block_numa (Interestingly, this throws ORA-600 error during startup)
3. Or completely eliminate db link based architecture with streams or materialized views.

In case you were wondering, I presented part of this in SIOUG. This blog can be found in a presentation format here. High CPU usage SYS mode .

This entry was posted on October 31, 2008 at 8:30 pm and is filed under Oracle database internals, Performance tuning. Tagged: numa, oracle, performance, RAC performance myths, scientific tuning, shmdt, _db_block_numa, _enable_numa_optimization. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

13 Responses to “Performance issue: High Kernel mode CPU usage”

Jonathan Lewis said

November 1, 2008 at 10:35 am
Riyaj

Very nice article.

I wonder if this is why 10g2 has introduced a “_logout_storm_rate” parameter to limit the number of detaches per second ? I had assumed that it was related to process cleanup in the shared pool – but your problem is much more dramatic.

Do you know if the initial ‘shm attach’ call is equally expensive on CPU, or is it just detach that is expensive ?

Is it possible that the cost of detaching is related to the size of the memory map that has been constructe as the process used memory ? (In which case large pages or ISM might be helpful).

Reply
orainternals said

November 1, 2008 at 11:16 pm
Hi Jonathan
Thank you very much for your comment.

I checked shmat calls and that is expensive too. 0.23 seconds spent on shmat calls per connection made.

15341: 0.1744 shmat(13, 0x1440000000, 040000) = 0x1440000000
15341: 0.1828 shmat(4, 0x380000000, 040000) = 0x380000000
15341: 0.2054 shmat(5, 0x440000000, 040000) = 0x440000000
15341: 0.2278 shmat(6, 0x640000000, 040000) = 0x640000000
15341: 0.2504 shmat(7, 0x840000000, 040000) = 0x840000000
15341: 0.2728 shmat(8, 0xA40000000, 040000) = 0xA40000000
15341: 0.2953 shmat(9, 0xC40000000, 040000) = 0xC40000000
15341: 0.3173 shmat(10, 0xE40000000, 040000) = 0xE40000000
15341: 0.3404 shmat(11, 0x1040000000, 040000) = 0x1040000000
15341: 0.3634 shmat(12, 0x1240000000, 040000) = 0x1240000000

_logout_storm_rate is a promising parameter. I will try to test this. Not sure, client will agree to set this parameter though.

Instance is using ISM (just checked with pmap). So, I would think, effect of page tables size should be small.

0000000440000000 7962624 7962624 – 7962624 rwxsR [ ism shmid=0x5 ]

Cheers
Riyaj

Reply
NUMA « Oracle Scratchpad said

November 2, 2008 at 6:41 pm
[…] Jonathan Lewis @ 6:40 pm UTC Nov 2,2008 Browsing through a few blogs over the weekend, I found this interesting item from Riyaj Shamsudeen about tracking down a lost ASM heartbeat and excessive CPU usage on […]

Reply
tanelp said

November 2, 2008 at 10:08 pm
Cool troubleshooting!

Another option is to use a BEFORE LOGOFF trigger to throttle number of logouts when certain conditions are met. You can use global context variables to keep track of logoff rates instance-wide..

Btw, the truss -d option in Solaris just gives you timestamps when system calls completed, not the time spent in them. The truss -c should give you time spent *inside* syscalls.

Reply
Surachart Opun said

November 3, 2008 at 5:26 am
great article.

Reply
Christian Bilien said

November 3, 2008 at 9:03 am
Hi Riyaj,

I wonder if you are running on a Solaris/AMD Opteron: Oracle on Solaris considers each Opteron processor as a unit on par with a SPARC domain. An eight 2-cores system with a 32GB RAM will hold eight Oracle ccNUMA nodes. Oracle’s SGA will be split in small 2-4 GB chunks depending on the SGA allocation. Small chunks relative to the allocated SGA lower the probability for a thread to access its “local” memory. This can make the whole scheme irrelevant and possibly creates overhead and bugs (such as bug 5173642 references in Metalink) for non-partitioned servers.

_enable_NUMA_optimization was available (default value: FALSE) until Oracle 10.1.0 and was made obsolete in 10.2.0.
_db_block_numa is since Oracle 10.2.0.2 the number of Numa “nodes”. The “node” is the ccNUMA awareness unit.

Christian

Reply
Freek said

November 3, 2008 at 12:20 pm
Hi Riyaj,

very good article.

I just have 1 question: you say the system has 12 cpu’s, but isn’t the mpstat output showing 14 cpu’s?

regards,

Freek

Reply
orainternals said

November 3, 2008 at 2:35 pm
First, Thanks very much to you all for reading my blog.

Hi Tanel
Thanks for the idea about LOGOFF trigger. That’s a cool idea to explore and use global context variables to keep track of rate of disconnects. Yes, -c flag might have been more accurate in that truss output.

Hi Christian
Let me check on hardware configuration. It’s a Solaris zone and I don’t have much access in client environment.

Hi Freek
Yes, that was a mistake..Should be 16 CPUs and will correct it. I think, they are dual core CPUs.

Cheers
Riyaj

Reply
orainternals said

November 3, 2008 at 2:58 pm
Hi Christian
Yes, these are Solaris/AMD opteron CPUs.

Ryan Matteson, An excellent solaris system administrator [ with our client] said: ” That would make sense, since each processor has a built-in memory controller that has direct access to one or more banks of memory. CPUs are: Dual-Core AMD Opteron(tm) Processor 8220 CPU ”

Cheers
Riyaj

Reply
Carol D said

November 26, 2008 at 3:38 pm
Riyaj,
Could you have hit Bug 7171446 for 10.2.0.4 systems with NUMA enabled by default? Just wondering? And this was a great blog post!
Thanks, Carol

Reply
orainternals said

November 26, 2008 at 6:25 pm
Hi Carol

Thanks for reading my blog and for your kind words. Thank you for pointing out that bug. Our problem (Early September) actually predates that bug date ( 23-OCT-2008) and I didn’t see that bug report earlier either. I am printing a summary of the bug below.

Bug summary: In 10.2.0.4 NUMA features were enabled by default and this can cause issues in platforms/hardware without NUMA capability. Workaround is to disable NUMA.

In this specific case though, hardware had NUMA capability as shown in lgrpinfo output. Just that we were hitting a side effect of NUMA capability.

Cheers
Riyaj

Reply
lscheng said

March 15, 2009 at 8:53 am
Hi

Recently Oracle introduced a new “Best Practice” which is turning off NUMA optimization. May be in future patchsets we might see these default values:

_enable_NUMA_optimization=false
_db_block_numa=1

Reply
- orainternals said
  
  March 15, 2009 at 5:38 pm
  Hello Lscheng
  Can you post URL or document id referring to those best practices here, please? I am trying to understand why such an important feature[ at least, in high-end servers] would need to be completely turned off..
  
  Cheers
  Riyaj
  
  Reply

Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

Categories

Archives

Click image for details:

Links for training, scripts etc

Blogroll oracle

Blogroll