Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

Data visualization, px qref waits, and a kernel bug!

Posted by Riyaj Shamsudeen on July 2, 2014

Data visualization is a useful method to identify performance patterns. In most cases, I pull custom performance metrics from AWR repository and use tableau to visualize the data. Of course, you can do the visualization using excel spreadsheet too.

Problem definition
We had huge amount of PX qref waits in a database:

                                            Tota    Wait   % DB
Event                                 Waits Time Avg(ms)   time Wait Class
------------------------------ ------------ ---- ------- ------ ----------
PX qref latch                    63,669,198 3341       0   34.2 Other
DB CPU                                      3031           31.1
direct path read temp                92,996 199.       2    2.0 User I/O
direct path write temp               12,029 109.       9    1.1 User I/O
direct path read                      5,454 61.1      11     .6 User I/O

PX qref related waits indicates that producers are generating data faster than the consumers can consume, in a parallel execution. So, producers are pausing until the consumers can catch up.

When did this issue start?

Unfortunately, a few things happened during an interval of about 2 days, such as node restarts, upgrade to, code change etc. Obviously, upgrade seems to be an obvious culprit, but, Is it? I customized my AWR repository script awr_system_events.sql for the above wait events. Zip file of the script is awr_system_events. (set column separator in the sqlplus session before running the script as per your wish).

Loading the data to tableau software and graphing the waits, you can see clearly that there is a marked increase in waits. Further, data shows that there is a strong correlation of the problem, waits increased abruptly after the restart of node. After all, the database upgrade or code change itself is not the culprit.


Why would a node restart cause this issue?

Researching further, I didn’t see any obvious issues. Node was running without any restart for an year or so, no kernel changes, no kernel parameter changes etc. Of course, there were comments from support that we have too many PX servers etc, but we have been running in this configuration for a while, why would this problem start after a node restart?. Let’s review OS statistics, vmstat output first:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free     buff  cache     si   so   bi    bo   in   cs    us sy id wa st
17  0      0 183440192 146848 1419528    0    0     0   150 17487 28314 12  2 86  0  0
49  0      0 183439728 146860 1419664    0    0     0   114 11105 14789  5  2 94  0  0 <---
16  0      0 183431424 146884 1420128    0    0     0   133 12700 20679  8  2 90  0  0
20  0      0 183361616 146900 1420208    0    0     0   153 9606 11645  3  2 95  0  0
 6  0      0 182956832 146912 1420316    0    0     0    76 10323 14467  4  2 94  0  0
 9  0      0 182938752 146920 1421012    0    0    26   177 17318 24782 12  3 85  0  0
21  0      0 182935552 146928 1421132    0    0     0   509 10556 11367  4  2 94  0  0

Do you see any obvious issues in the vmstat output? Review the line where the r column (running+runnable) is 49. Even though, there are 49 processes in the queue, notice that 94% of the processing time is idle in that one second time slice. Why would there be 49 processes in the queue, if the processors are idle? Let’s review mpstat output:

12:45:08 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
12:46:38 PM  all    9.54    0.00    1.61    0.32    0.00    0.11    0.00   88.43  12116.40
12:46:43 PM  all   30.31    0.00    2.20    0.00    0.00    0.44    0.00   67.06  22066.33
12:46:48 PM  all   19.12    0.00    2.01    0.00    0.00    0.26    0.00   78.62  16903.80
12:46:53 PM  all   15.03    0.00    1.92    0.00    0.00    0.19    0.00   82.86  14318.73

Unfortunately, OSWatcher is not granular enough to give us CPU level stats, however, we can confirm that indeed, about 80% of CPUs are idle. Mystery deepens!

So, what gives?

We guessed that there must be some relevance to the Linux task scheduler: since there many idle CPUs and that numerous processes in the run queue waiting to be scheduled. Fortunately, recent kernels provide a mechanism to review the state of the Linux task schedulers. cat /proc/sched_debug output is showing something interesting. Can you spot it? For brevity, output is heavily edited and shown partially.

Sched Debug Version: v0.09, 2.6.32-400.21.1.el5uek #1
now at 159204140.576949 msecs
cpu#0, 2893.155 MHz
  .nr_running                    : 18
runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
          oracle 27983  27439528.859604     12305   120  27439528.859604     23554.634224 1261008291037.725263 /
          oracle 36518  27441135.821233      1237   120  27441135.821233     22483.062837 180144048941.261644 /
R         oracle 36522  27439532.427889      3201   120  27439532.427889     21032.542453 414331241620.414187 /
          oracle 36526  27439533.339628      1846   120  27439533.339628     17964.739415 234187242151.672383 /
cpu#1, 2893.155 MHz
  .nr_running                    : 0
runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
cpu#2, 2893.155 MHz
  .nr_running                    : 0

Notice that .nr_running for CPU#0 is 18, but for all other CPUs .nr_running is either 0 or 1, meaning that, for some reason, the Linux task scheduler is scheduling all tasks in CPU#0 only. So, CPU#0 is busy, run queue is much longer in that CPU, while all other CPUs are idle. So, PX server processes are suffering from CPU starvation as they are waiting in the queue, the consumers not able to consume the messages quick enough which leads to enormous amount of waits for PX qref related waits.

In Summary, we were able to visualize the data, avoid falling for the trap of blaming the upgrade, and able to identify the root cause of the problem scientifically. The root cause boiled down to a bug in the kernel 2.6.32-400.21.1.el5uek. This bug doesn’t act all the time though. We upgraded the Exadata environment to which resolved the problem completely.

To be perfectly clear, PX qref latch waits can happen for true CPU starvation too. So, do not assume that you are encountering the bug just because there are waits for PX qref latch waits. Second, I am not implying that every one should upgrade the kernel immediately, however, if you encounter this issue, you just might have to upgrade :-)

Update 1: Uploading new scripts after Coskan’s comments.

5 Responses to “Data visualization, px qref waits, and a kernel bug!”

  1. dalinqin said

    excellent job. by the way I assume you are setting “CONFIG_SCHED_DEBUG=y” to enable the schedule debug , am I correct?

    • Yes, config_sched_debug must be set to Y. However, in Exadata platform, the default is ‘Y’. I checked few other servers and looks like the /proc/sched_debug is updated. I am inclined to think, config_sched_debug is Y, by default, in other distributions too, however, I am not sure.

  2. santosh said

    These analysis are really great. These analysis definetly help in changing the thinking feather in few more direction if someone came across the same issue.

  3. coskan said


    Great analysis, can you please share what is the purpose of table sys.WRH$_SYSTEM_EVENT_BL in awr_px_qref_latch.

    Why not just dba_hist_system_event but union of sys.WRH$_SYSTEM_EVEN and sys.WRH$_SYSTEM_EVENT_BL

    • Hello Coskan
      Thanks for reading my blog.
      As you know, _BL tables stores baselines. In a few customer databases, we use baselines aggressively to keep the AWR history for a longer period and reduce sysaux tablespace usage. So, that’s the reason for combining both history and baselines to visualize the data for a bigger time window.
      Visualization of script output will work fine if there are no baselines or if there are numerous baselines. However, as you are pointing out, visualization might be affected if there are a few baselines in the AWR tables, as the time interval will not be uniform in one of the Axis. I will upload a zip containing two scripts, one querying dba_hist_system_event and another querying the union of both history and baselines. Hopefully, that will be useful for someone in the future.
      Of course, let me test the scripts first :)


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 193 other followers

%d bloggers like this: