After reading my earlier blog about log file sync tuning a reader sent an interesting issue, worth blogging about it.
Problem
Excessive wait time for ‘log file sync’ event while wait time for ‘log file parallel write’ is minimal. See statspack top events below
Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time -------------------------------------------- ------------ ----------- -------- CPU time 18,435 31.06 PL/SQL lock timer 4,370 12,137 20.45 log file sync 57,939 8,565 14.43 <----- db file sequential read 1,303,477 7,030 11.85 db file scattered read 505,147 3,992 6.73 Avg Total Wait wait Waits Event Waits Timeouts Time (s) (ms) /txn ---------------------------- ------------ ---------- ---------- ------ -------- ... log file sync 57,939 6,571 8,565 148 0.5 ... log file parallel write 215,112 0 1,138 5 1.9 ...
Average wait time for ‘log file sync’ event is 148ms and it is 5ms for ‘log file parallel write’ event. There is an oddity here. Wait time for ‘log file sync’ event seems bit excessive compared to ‘log file parallel write’ event. Ratio between number of waits for ‘log file sync’ and ‘log file parallel write’ is approximately 4. In an average case scenario, if each commit resulted in 4 log writes, then I would expect an Average wait time of 20ms for ‘log file sync’ event. Wait time of 148ms? Needs further analysis.
pstack, pmap and more
Database is running in a Solaris 10 server and database version 9.2.0.8. Since this is a Solaris operating environment, we can use proc utilities to debug this further. We used a script to dump pstack and pmap of LGWR in a loop. This wasn’t much help, but we could guess that LGWR was waiting for I/O since there were many lines in the stack with kaio and aio wait calls. Few lines from pstack output printed below.
6924: ora_lgwr_ERPPRD2 ----------------- lwp# 1 / thread# 1 -------------------- ffffffff7ddd4838 semsys (4, 3, ffffffff7fffd5cc, 1, ffffffff7fffd5b8) 00000001005b2694 ksliwat (0, 3c0, 77e8a7790, 1037c0b30, 3, 0) + 274 .. ----------------- lwp# 2 / thread# 2 -------------------- ffffffff7ddd3a34 kaio (6, 0, ffffffff7def7f80, 10, ffffffff7def9c18, ffffffff7f700a00) ----------------- lwp# 3 / thread# 3 -------------------- ffffffff7ddd2f1c lwp_park (0, 0, 0) ffffffff7ddcc748 cond_wait_queue (1038e5a70, 1038e5a80, 0, 0, ffffffff7def9d08, 0) + 28 ffffffff7ddccca8 cond_wait (1038e5a70, 1038e5a80, 0, 0, 1c00, 0) + 10 ffffffff7db03a60 _aio_idle (1038e5a00, 1, 1038e5a80, 0, ffffffff7def9c18, 0) + 28 ffffffff7db034fc _aio_send_sigev (1038e5a00, 0, 104b60, ffffffff7ddd2cc8, ffffffff7db03498, ffffffff7dc08000) + 64
dtruss
I/O statistics did not provide much clarity either. We need to find if LGWR is suffering from a performance issue. To see if LGWR is suffering from any OS related issues, we need to trace system calls from LGWR and Truss utility provides such a facility. Suffering from truss related paranoia, we didn’t want to run truss against LGWR since that can affect performance, more importantly database stability.
Fortunately, dtruss came handy. dtruss is based upon dtrace utility and by design, dtrace is safe in Solaris Sparc platform.