Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

Performance tuning: HugePages and Linux

Posted by Riyaj Shamsudeen on November 13, 2008

Recently we quickly and efficiently resolved a major performance issue with one of our New York clients. In this blog, I will discuss about this performance issue and its solution.

Problem statement

The client’s central database was intermittently freezing because of high CPU usage, and their business severely affected. They had already worked with vendor support and the problem was still unresolved.

Symptoms

Intermittent High Kernel mode CPU usage was the symptom. The server hardware was 4 dual-core CPUs, hyperthreading enabled, with 20GB of RAM, running a Red Hat Linux OS with a 2.6 kernel.

During this database freeze, all CPUs were using kernel mode and the database was almost unusable. Even log-ins and simple SQL such as SELECT * from DUAL; took a few seconds to complete. A review of the AWR report did not help much, as expected, since the problem was outside the database.

Analyzing the situation, collecting system activity reporter (sar) data, we could see that at 08:32 and then at 8:40, CPU usage in kernel mode was almost at 70%. It is also interesting to note that, SADC (sar data collection) also suffered from this CPU spike, since SAR collection at 8:30 completed two minutes later at 8:32, as shown below.

A similar issue repeated at 10:50AM:

07:20:01 AM CPU   %user     %nice   %system   %iowait     %idle
07:30:01 AM all    4.85      0.00     77.40      4.18     13.58
07:40:01 AM all   16.44      0.00      2.11     22.21     59.24
07:50:01 AM all   23.15      0.00      2.00     21.53     53.32
08:00:01 AM all   30.16      0.00      2.55     15.87     51.41
08:10:01 AM all   32.86      0.00      3.08     13.77     50.29
08:20:01 AM all   27.94      0.00      2.07     12.00     58.00
08:32:50 AM all   25.97      0.00     25.42     10.73     37.88 <--
08:40:02 AM all   16.40      0.00     69.21      4.11     10.29 <--
08:50:01 AM all   35.82      0.00      2.10     12.76     49.32
09:00:01 AM all   35.46      0.00      1.86      9.46     53.22
09:10:01 AM all   31.86      0.00      2.71     14.12     51.31
09:20:01 AM all   26.97      0.00      2.19      8.14     62.70
09:30:02 AM all   29.56      0.00      3.02     16.00     51.41
09:40:01 AM all   29.32      0.00      2.62     13.43     54.62
09:50:01 AM all   21.57      0.00      2.23     10.32     65.88
10:00:01 AM all   16.93      0.00      3.59     14.55     64.92
10:10:01 AM all   11.07      0.00     71.88      8.21      8.84
10:30:01 AM all   43.66      0.00      3.34     13.80     39.20
10:41:54 AM all   38.15      0.00     17.54     11.68     32.63 <--
10:50:01 AM all   16.05      0.00     66.59      5.38     11.98 <--
11:00:01 AM all   39.81      0.00      2.99     12.36     44.85

Performance forensic analysis

The client had access to a few tools, none of which were very effective. We knew that there is excessive kernel mode CPU usage. To understand why, we need to look at various metrics at 8:40 and 10:10.

Fortunately, sar data was handy. Looking at free memory, we saw something odd. At 8:32, free memory was 86MB; at 8:40 free memory climbed up to 1.1GB. At 10:50 AM free memory went from 78MB to 4.7GB. So, within a range of ten minutes, free memory climbed up to 4.7GB.

07:40:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached
07:50:01 AM    225968  20323044     98.90    173900   7151144
08:00:01 AM    206688  20342324     98.99    127600   7084496
08:10:01 AM    214152  20334860     98.96    109728   7055032
08:20:01 AM    209920  20339092     98.98     21268   7056184
08:32:50 AM     86176  20462836     99.58      8240   7040608
08:40:02 AM   1157520  19391492     94.37     79096   7012752
08:50:01 AM   1523808  19025204     92.58    158044   7095076
09:00:01 AM    775916  19773096     96.22    187108   7116308
09:10:01 AM    430100  20118912     97.91    218716   7129248
09:20:01 AM    159700  20389312     99.22    239460   7124080
09:30:02 AM    265184  20283828     98.71    126508   7090432
10:41:54 AM     78588  20470424     99.62      4092   6962732  <--
10:50:01 AM   4787684  15761328     76.70     77400   6878012  <--
11:00:01 AM   2636892  17912120     87.17    143780   6990176
11:10:01 AM   1471236  19077776     92.84    186540   7041712

This tells us that there is a correlation between this CPU usage and the increase in free memory. If free memory goes from 78MB to 4.7GB, then the paging and swapping daemons must be working very hard. Of course, releasing 4.7GB of memory to the free pool will sharply increase paging/swapping activity, leading to massive increase in kernel
mode CPU usage. This can lead to massive kernel mode CPU usage.

Most likely, much of SGA pages also can be paged out, since SGA is not locked in memory.

Memory breakdown

The client’s question was, if paging/swapping is indeed the issue, then what is using all my memory? It’s a 20GB server, SGA size is 10GB and no other application is running. It gets a few hundred connections at a time, and PGA_aggregated_target is set to 2GB. So why would it be suffering from memory starvation? If memory is the issue, how can there be 4.7GB of free memory at 10:50AM?

Recent OS architectures are designed to use all available memory. Therefore, paging daemons doesn’t wake up until free memory falls below a certain threshold. It’s possible for the free memory to drop near zero and then climb up quickly as the paging/swapping daemon starts to work harder and harder. This explains why free memory went down to 78MB and rose to 4.7GB 10 minutes later.

What is using my memory though? /proc/meminfo is useful in understanding that, and it shows that the pagetable size is 5GB. How interesting!

Essentially, pagetable is a mapping mechanism between virtual and physical address. For a default OS Page size of 4KB and a SGA size of 10GB, there will be 2.6 Million OS pages just for SGA alone. (Read wikipedia’s entry on page table for more information about page tables.) On this server, there will be 5 million OS pages for 20GB total memory. It will be an enormous workload for the paging/swapping daemon to manage all these pages.

cat /proc/meminfo

MemTotal:     20549012 kB
MemFree:        236668 kB
Buffers:         77800 kB
Cached:        7189572 kB
...
PageTables:    5007924 kB  <--- 5GB!
...
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

HugePages

Fortunately, we can use HugePages in this version of Linux. There are couple of important benefits of HugePages:

  1. Page size is set 2MB instead of 4KB
  2. Memory used by HugePages is locked and cannot be paged out.

With a pagesize of 2MB, 10GB SGA will have only 5000 pages compared to 2.6 million pages without HugePages. This will drastically reduce the page table size. Also, HugeTable memory is locked and so SGA can’t be swapped out. The working set of buffers for the paging/swapping daemon will be smaller.

To setup HugePages, the following changes must be completed:

  1. Set the vm.nr_hugepages kernel parameter to a suitable value. In this case, we decided to use 12GB and set the parameter to 6144 (6144*2M=12GB). You can run:
    echo 6144 > /proc/sys/vm/nr_hugepages

    or

    sysctl -w vm.nr_hugepages=6144

    Of course, you must make sure this set across reboots too.

  2. The oracle userid needs to be able to lock a greater amount of memory. So, /etc/security/limits.conf must be updated to increase soft and hard memlock values for oracle userid.
    
    oracle          soft    memlock        12582912
    oracle          hard   memlock        12582912
    

After setting this up, we need to make sure that SGA is indeed using HugePages. The value, (HugePages_Total- HugePages_Free)*2MB will be the approximate size of SGA (or it will equal the shared memory segment shown in the output of ipcs -ma).

cat /proc/meminfo |grep HugePages
HugePages_Total:  6144
HugePages_Free:   1655 <-- Free pages are less than total pages.
Hugepagesize:     2048 kB

Summary

Using HugePages resolved our client’s performance issues. The PageTable size also went down to a few hundred MB. If your database is running in Linux and has HugePages capability, there is no reason not to use it.

This can be read in a presentation format at Investigations: Performance and hugepages (PDF). See also our hugepages tag.

Update 1: Corrected the typo in the path for limits.conf after a Reader's observation.

8 Responses to “Performance tuning: HugePages and Linux”

  1. Vivek said

    Hi Riyaz,

    A nice one. Is it applicable for IBM AIX as well ? I have a 2 node RAC on IBM AIX and one of the node evicts during heavy pagins and page outs.

    Regards
    Vivek

  2. Hi Vivek
    Paging/swapping is applicable to IBM AIX as well. But, hugepages may not be directly applicable to AIX.
    Nevertheless, try to understand why there is memory starvation and resolve it.

    Cheers
    Riyaj

  3. Hi Riyaj,

    cat /proc/meminfo |grep HugePages
    HugePages_Total: 6144
    HugePages_Free: 1655 <– Free pages are less than total pages.
    Hugepagesize: 2048 kB

    Keeping HugePages free ( in this example – 1655 – ~3.2GB) would result in physical memory being locked up and not available outside of the Oracle Shared memory realm. If not planning on dynamically resizing the SGA beyond what is set initially, it probably would not make sense to leave so many free huge pages.

    Since huge pages are locked in memory, they would not get swapped/paged out. And if huge pages are sized more than required, it can result in more aggresive swapping/paging since the total available physical memory for PGA and other OS needs would be reduced.

    Thanks
    Krishna Manoharan

  4. lscheng said

    I had a customer using 11gR1 in Linux x86-64 platform and automatic memory management feature went through some memory starvation problems (kswapd went crazy freeing memory) a couple of months ago and in order to implement Hugepages AMM had to be turned off. I am not sure if this would be fixed in the future because looks like AMM implementation cannot make use of hugepages.

  5. Lynn Conrad said

    What was swappiness set too, on aix 5.3 my standard is lru_file_repage=0, memory_affinity=0, v_pinshm=0, maxpin%=94 chuser oracle capabilites=CAP_BYPASS_RAC_VMM, CAP_PROPAGATE, minperm%=5 maxperm%=90 db init parameter lock_sga=true

  6. [...] from Riyaz Shamshuddin quite identical to the issue we had faced. I would just link it here – Performance tuning: HugePages and Linux GA_googleAddAttr("AdOpt", "1"); GA_googleAddAttr("Origin", "other"); [...]

  7. Jigar Doshi said

    Minor correction. I think it should be /etc/security/limits.conf instead of /etc/securities/limits.conf. Please correct

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 191 other followers

%d bloggers like this: