In-memory pre-population speed

Posted by Riyaj Shamsudeen on October 6, 2014

While presenting at Oaktable World 2014 in San Fransisco, I discussed the in-memory pre-population speed and indicated that it takes about 30 minutes to 1 hour to load ~300GB of tables. Someone asked me “Why?” and that was a fair question. So, I profiled the in-memory pre-population at startup.

Profiling methods

I profiled all in-memory worker sessions using Tanel’s snapper script and also profiled the processes in OS using Linux perf tool with 99Hz sample rate. As there is no other activity in the database server, it is okay to sample everything in the server. Snapper output will indicate where the time is spent; if the time is spent executing in CPU, then the perf report output will tell us the function call stack executing at that CPU cycle. Data from these two profiling methods will help us to understand the root cause of slowness.

@snapper.sql out,gather=stw 600 10 “select sid from v$session where program like ‘%W00%'”
Perf tool : perf record -F 99 -u oracle -g sleep 3600

Discussion of the output

In the snapper output printed below, session with sid=17 spent about 89% of the time on executing in the CPU. Essentially, most of the time spent on executing CPU itself, not on any other wait event. So, the output of command ‘perf report’ will be useful to understand the function call consuming CPU cycles.

snapim2.txt:
   SID, USERNAME  , TYPE, STATISTIC                               ,         DELTA, HDELTA/SEC,    %TIME, GRAPH
     15, (W003)    , TIME, background IM prepopulation cpu time    ,     511933187,   850.39ms,    85.0%, |@@@@@@@@@ |
     17, (W009)    , TIME, background IM prepopulation cpu time    ,     537411293,   892.71ms,    89.3%, |@@@@@@@@@ |
...
--  End of snap 1, end=2014-10-01 08:48:29, seconds=602

In the output below, 17% of the time spent using the CPU and 85.4% of time spent on single block reads( there is 2% over counting here). This split makes sense since if the process is waiting for I/O, then its CPU usage will be minimal.

    184, (W006)    , TIME, background IM prepopulation elapsed time,     614868987,      1.02s,   102.1%, |@@@@@@@@@@|
    184, (W006)    , TIME, background cpu time                     ,     105862907,   175.85ms,    17.6%, |@@        |
    184, (W006)    , TIME, background IM prepopulation cpu time    ,     105878904,   175.88ms,    17.6%, |@@        |
    184, (W006)    , TIME, background elapsed time                 ,     614853823,      1.02s,   102.1%, |@@@@@@@@@@|
    184, (W006)    , WAIT, db file sequential read                 ,     514181321,   854.12ms,    85.4%, |@@@@@@@@@ |

Visualization

After an hour after the database startup we have sufficient data to analyze. Using R language script and a bit of UNIX awk scripts, we can format and visualize the data. I cheated and extracted just relevant section from the snapper output using egrep. BTW, snapim.txt is the spooled output of snapper script. First egrep command captures ‘IM prepopulation cpu time’ in a file named snapim2.txt and the second egrep command captures the percent of waits for ‘db file sequential reads’ wait event.

# Egrep to grep only specific lines.
egrep   'STATISTIC|background IM prepopulation cpu time|End of snap' snapim.txt >snapim2.txt
egrep   'STATISTIC|db file sequential |End of snap' snapim.txt >snapim3.txt

Now, we have just bare minimal data in these two files: snapim2.txt and snapim3.txt. It is easy to plot them as a scattered plot using R script. As we are only interested in the Y axis, we will use sequence values for the X axis. Y axis is the percent of time spent and X axis is sequence values.

# R script to read a csv file and plot them in x and y axis as a scattered plot.
w1<-read.csv(file='D:/Riyaj/my tech docs/presentations/OTW2014/snapim2.txt')
plot (seq(1, length(w1$"X.TIME")), w1$"X.TIME", xlab="Sample_N_Time", ylab="IM Prepopulation CPU time")
w2<-read.csv('D:/Riyaj/my tech docs/presentations/OTW2014/snapim3.txt')
plot (seq(1,length(w2$"X.TIME")), w2$"X.TIME", xlab="Sample_N_TIME", ylab="db file sequential read")

First picture is plotting the ‘background IM pre-population CPU time’ as a percentage of total time spent, with a snap interval of 600 seconds, 6 snaps in total. You can clearly see two groups: Initially, the processes were spending time executing in CPU. AS time goes by, processes were spending less CPU time. As we just grep’ed only the relevant lines from snapper output, the timeline goes from left to right too.

Following picture is plotting the ‘db file sequential reads’ as percentage of total time spent. Again, initially, waits for single block reads are minimal and as time goes by, more and more sessions are waiting for single block reads. These two statistics are reverse correlating strongly. Reviewing the output manually, if there is huge ‘db file sequential reads’ waits, I see that much of the time spent on ‘fetch continued row’ statistics indicating huge amount of row chaining.

Analysis of perf output

Since, we know that CPU usage was higher in the initial 30 minutes of pre-population, we need to identify the functions consuming CPU. Again, a bit of awk and grep: Essentially, following code is equivalent of “sum(percent) group by function call”, as the perf tool captures granular data.

# perf report --stdio > perf_report_inmem.txt
# grep oracle perf_report_inmem.txt | awk '{print $5" "$1 }'|sort|sed 's/%//'|grep -v '0.0' |awk '{sum[$1]+=$2} END {for (j in sum) print sum[j], j}' |sort -k1nr 
8.57 kdzu_basic_insert
6.92 kdzu_csb_compare_fast
6.22 kdzu_csb_node_bsearch
4.59 kdzu_dict_insert
4.45 BZ2_decompress
4.1 __intel_ssse3_rep_memcpy
2.39 kdzu_csb_search
2.38 kdzu_basic_minmax
2.14 kdzcbuffer_basic
2.02 kdzu_get_next_entry_from_basic
1.38 unRLE_obuf_to_output_FAST
1.24 skswrEqual
0.59 kdzu_dict_create_from_basic
0.55 _intel_fast_memcmp

Functions consuming most CPU cycles is associated with kudzu oracle module. kdzu code module is to populate the in-memory column store using query high compression. You can also see 4.5% of time spent on BZ2_decompress, a function associated with HCC archive high decompression. While I am positive that there may be some opportunities to tune these Oracle functions (yes, you guessed it right, Oracle developers have to do that), there isn’t a smoking gun function. Time is spread around many function calls. Annotating few functions further, I see that, arithmetic instructions such as addl and sub are consuming time, I guess, offloading math processing will help, not sure.

Summary

In summary, we looked deep in to the performance of in-memory pre-population using various tools at our disposal. We can probably derive the following facts:

In-memory follows the chained rows. So, you should check for chained rows before converting to in-memory column store. Chained rows slow down the in-memory population dramatically.
CPU usage will be higher during pre-population, but no function calls are standing out as an issue. Hopefully, Oracle development will profile their code to tune their functions further. May be, some assembly language expert might need to review the instructions.
If the tables are compressed heavily, you should expect CPU usage to decompress from HCC and compress to memcompress.

This entry was posted on October 6, 2014 at 6:38 pm and is filed under 12c, in-memory, inmemory, Oracle database internals, Performance tuning, Presentations. Tagged: 12c in-memory, in-memory, kdzu, perf record, perf report, pre-population cpu time, pre-population speed, snapper. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response to “In-memory pre-population speed”

ITS said

December 8, 2018 at 4:05 am
There’s definately a great deal to find out about this issue.

I love all the points you’ve made.

Reply

Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

Categories

Archives

Click image for details:

Links for training, scripts etc

Blogroll oracle

Blogroll