Tuning ‘log file sync’ wait events

Posted by Riyaj Shamsudeen on July 7, 2008

In this blog entry, we will discuss strategies and techniques to resolve ‘log file sync’ waits. This entry is intended to show an approach based upon scientific principes, not necessarily a step by step guide. Let’s understand how LGWR is inherent in implementing commit mechanism first.

Commit mechanism and LGWR internals

At commit time, process creates a redo record [ containing commit opcodes] and copies that redo record in to log buffer. Then that process signlas LGWR to write contents of log buffer. LGWR writes from log buffer to log file and signals user process back completing a commit. Commit is considered successful after LGWR write is successful.

Of course, there are minor deviation from this general concept such as latching, commits from plsql block or IMU based commit generation etc. But general philosophy still remains the same.

Signals, semaphore and LGWR

Following section introduces internal workings of commit and LGWR interation in unix platform. There are minor implementation
differences between few unix flavors or platform like NT/XP such as use of post wait drivers instead of semaphores etc. This section is to introduce internals, not necessarily dive deep in to internals. Truss is used to trace LGWR and user process to explain here.

truss command used: truss -rall -wall -fall -vall -d -o /tmp/truss.log -p 22459

[ Word of caution, don’t truss LGWR or any background process unless it is absolutely necessary. You can
accidentally cause performance issues, worse yet, shutdown database.]

1. Intiailly, LGWR is sleeping on semaphore using semtimedop or semop call.

22459/1: semtimedop(9, 0xFFFFFD7FFFDFE648, 1, 0xFFFFFD7FFFDFE488) EAGAIN
22459/1:         semnum=15    semop=-1    semflg=0
22459/1:         timeout: 2.060000000 sec
    In the above call,
     9 is semaphore set id visible through ipcs command and semnum=15 is the semaphore for LGWR process in that set.
     next argument is a structure sembuf
	{ unsigned short sem_num;  /* semaphore number */
	  short          sem_op;   /* semaphore operation */
	  short          sem_flg;  /* operation flags */
	 }
	third argument is # of semaphores

2. When a session commits, a redo record created and copied in to log buffer. Then that process posts LGWR semaphore
using a semctl call, if LGWR is not active already. Then, process goes to sleep with semtimedop call, in its own semaphore.
Semaphore set id is 9, but semnum is 118 which is for the user process I was tracing.

First semctl calls is posting LGWR. Then process is sleeping on semtimedop call.

27396/1: semctl(9, 15, SETVAL, 1)        = 0
..
27396/1:semtimedop(9, 0xFFFFFD7FFFDFC128, 1, 0xFFFFFD7FFFDFBF68)=0
27396/1:        semnum=118   semop=-1    semflg=0
27396/1:        timeout: 1.000000000 sec

3. Waiting log writer gets a 0 return code from semtimedop and writes redo records to current redo log file. kaio calls are kernalized asynchronous I/O calls in Solaris platform.

22459/7: pwrite(262, "01 "9E0E i ?".., 1024, 1915904) = 1024
22459/9: pwrite(263, "01 "9E0E i ?".., 1024, 1915904) = 1024
22459/1: kaio(AIOWAIT, 0xFFFFFD7FFFDFE310)               = 1
22459/1:         timeout: 600.000000 sec
22459/9: kaio(AIONOTIFY, 0)                         = 0
22459/7: kaio(AIONOTIFY, 0)                         = 0

4. After successful completion of write(s), LGWR Posts semaphore of waiting process using semctl command.

22459/1: semctl(9, 118, SETVAL, 1)                   = 0

5. User process/Session continues after recieving a return code from semtimedop call, reprinted below.

27396/1:semtimedop(9,0xFFFFFD7FFFDFC128,1,0xFFFFFD7FFFDFBF68)=0

So, what exactly is ‘log file sync’ wait ?

Commit is not complete until LGWR writes log buffers including commit redo recods to log files. In a nutshell, after posting LGWR to write, user or background processes waits for LGWR to signal back with 1 sec timeout. User process charges this wait time as ‘log file sync’ event.

In the prior section, ‘log file sync’ waits starts at step 2 after semctl call and completes after step 5 above.

Root causes of ‘log file sync’ waits

Root causes of ‘log file sync’, essentially boils down to few scenarios and following is not an exhaustive list, by any means!

1. LGWR is unable to complete writes fast enough for one of the following reasons:

(a) Disk I/O performance to log files is not good enough. Even though LGWR can use asynchronous I/O, redo log files are opened with DSYNC flag and buffers must be flushed to the disk (or at least, written to disk array cache in the case of SAN) before LGWR can mark commit as complete.

(b) LGWR is starving for CPU resource. If the server is very busy, then LGWR can starve for CPU too. This will lead to slower response from LGWR, increasing ‘log file sync’ waits. After all, these system calls and I/O calls must use CPU. In this case, ‘log file sync’ is a secondary symptom and resolving root cause for high CPU usage will reduce ‘log file sync’ waits.

(d) LGWR is unable to complete writes fast enough due to file system or unix buffer cache limitations.

2. LGWR is unable to post the processes fast enough, due to excessive commits. It is quite possible that there is no starvation for cpu or memory and I/O performance is decent enough. Still, if there are excessive commits, then LGWR has to perform many writes/semctl calls and this can increase ‘log file sync’ waits. This can also result in sharp increase in redo wastage’ statistics’.

3. IMU undo/redo threads. With Private strands, a process can generate few Megabytes of redo before committing. LGWR must write generated redo so far and processes must wait for ‘log file sync’ waits, even if redo generated from other processes is small enough.

4. LGWR is suffering from other database contention such as enqueue waits or latch contention. For example, we have seen LGWR freeze due to CF enqueue contention. This is a possible scenario however unlikely.

5. Various bugs. Oh, yes, there are bugs introducing unnecessary ‘log file sync’ waits.

Root cause analysis

It is worthwhile to understand and identify root cause and resolve it.

1. First make sure, ‘log file sync’ event is indeed a major wait events. For example in the statspack report for 60 minutes below, ‘log file sync’ is indeed an issue. Why? Statspack is for 1800 seconds and there are 8 CPUs in the server. Approximately, available CPU seconds are 14,400 CPU seconds. There is just one database alone in this server and so, approximate CPU usage is 7034/14,400 : 50%

But, 27021 seconds were spent waiting. In average, 27021/3600=7.5 processes were waiting for ‘log file sync’ event. So, this is a major bottleneck for application scalability.

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                       % Total
Event                        Waits          Time (s) Ela Time
--------------------------- ------------    -------- --------
log file sync                1,350,499      27,021    50.04
db file sequential read      1,299,154      13,633    25.25
CPU time                                     7,034    13.03
io done                      3,487,217       3,225     5.97
latch free                     115,471       1,325     2.45

2. Identify and break down LGWR wait events. Query wait events for LGWR. In this instance LGWR sid is 3 (and usually it is).

select sid, event, time_waited, time_waited_micro
from v$session_event where sid=3 order by 3
SQL> /

   SID EVENT                          TIME_WAITED TIME_WAITED_MICRO
------ ------------------------------ ----------- -----------------
..
     3 control file sequential read        237848  2378480750
     3 enqueue                             417032  4170323279
     3 control file parallel write         706539  7065393146
     3 log file parallel write             768628  7686282956
     3 io done                           40822748  4.0823E+11
     3 rdbms ipc message                208478598  2.0848E+12

When LGWR is waiting ( using semtimedop call) for posts from the user sessions, that wait time is accounted as ‘rdbms ipc message’ event. This event, normally, can be ignored. Next highest waited event is ‘io done’ event. After submitting async I/O requests, LGWR waits until the I/O calls complete, since LGWR writes are done synchronous writes. [ asynchronous and synchronous are not contradictory terms when comes to I/O! Google it and there is enormous information about this already]

It is worth to note that v$session_event is a cumulative counter from instance startup and hence, this can be misleading. Difference between two snapshots from this view, for the same session, can be quite useful.

Tanel Poder has written an excellent tool for this. Using that tool, we can print out 1 second snapshot for LGWR session 3. Some part of the output removed to improve clarity. From the table below, 813 milli seconds were spent waiting for ‘io done’ event in an 1 second interval. That’s 81%.

------------------------------------------------------------------
SID, SNAPSHOT START  ,SEC, TYPE, STATISTIC            ,    DELTA,
------------------------------------------------------------------
3, 20080513 11:44:32,   1, STAT, messages sent        ,        9,
3, 20080513 11:44:32,   1, STAT, messages received    ,      153,
3, 20080513 11:44:32,   1, STAT, redo wastage         ,    39648,
3, 20080513 11:44:32,   1, STAT, redo writes          ,      152,
3, 20080513 11:44:32,   1, STAT, redo blocks written  ,     1892,
3, 20080513 11:44:32,   1, STAT, redo write time      ,       82,
3, 20080513 11:44:32,   1, WAIT, rdbms ipc message    ,   169504,
3, 20080513 11:44:32,   1, WAIT, io done              ,   813238,
3, 20080513 11:44:32,   1, WAIT, log file parallel write ,  5421,
3, 20080513 11:44:32,   1, WAIT, LGWR wait for redo copy ,     1,

3. Confirm LGWR is waiting for that event by SQL*Trace also. Tracing LGWR can detoriate performance further. So, careful consideration must be given before turning sqltrace on LGWR. Packages such as dbms_system / dbms_support or oradebug can be used to turn on 10046 event at level 12.

Few trace lines shown below. In this specific case, LGWR is waiting for I/O events. From the output below shows that LGWR submitted 2 write calls with 16 redo blocks. Waits for I/O completion events are counted as ‘io done’ event and shows that between two calls, LGWR waited for 1600 micro seconds or 1.6ms. Performance of write itself is not entirely bad.

  WAIT #0: nam='rdbms ipc message' ela= 7604 p1=223 p2=0 p3=0
  WAIT #0: nam='log file parallel write' ela= 35 p1=2 p2=16 p3=2
  WAIT #0: nam='io done' ela= 0 p1=0 p2=0 p3=0
  WAIT #0: nam='io done' ela= 639 p1=0 p2=0 p3=0
  WAIT #0: nam='io done' ela= 0 p1=0 p2=0 p3=0
  WAIT #0: nam='io done' ela= 605 p1=0 p2=0 p3=0
  WAIT #0: nam='io done' ela= 1 p1=0 p2=0 p3=0
  WAIT #0: nam='io done' ela= 366 p1=0 p2=0 p3=0

4. Let’s look at few other statistics also. From statspack report and one column removed to improve readability.

Statistic                             Total     per Second
------------------------------- ----------- --------------
redo blocks written                 230,881        2,998.5
redo buffer allocation retries            0            0.0
redo entries                        285,803        3,711.7
redo log space requests                   0            0.0
redo log space wait time                  0            0.0
redo ordering marks                       0            0.0
redo size                       109,737,304    1,425,159.8
redo synch time                      40,744          529.1
redo synch writes                    38,141          495.3
redo wastage                      5,159,124       67,001.6
redo write time                       6,226           80.9
redo writer latching time                 4            0.1
user calls                          433,717        5,632.7
user commits                         38,135          495.3
user rollbacks                            1            0.0
workarea executions - multipass           0            0.0

a. From the above statistics, we can see that 5632 user calls / second were made and 495 commits
were executed per second, in average.

b. 3000 redo blocks (of size 512 bytes in solaris) is written by LGWR, approximately 1.5MB/sec.

c. 1.4MB/sec redo was generated, which is approximately 16Mbps.

d. For 3000 redo blocks, 38,155 commits were created.

Essentially, redo size is high, but not abnormal. But, 500 commits per second is on the higher side.

5. Knowledge about application will be useful here. Trace few sessions and understand why there are so many commits, if commit frequency is higher. For example, in the following trace file, there is a commit after *every* SQL statement and that can be a root cause for these waits.

XCTEND call below with rlbk=0 and rd_only=0 is a commit.

WAIT #120: nam='SQL*Net message from client' ela= 291 p1=1952673792 p2=1 p3=0
XCTEND rlbk=0, rd_only=0
WAIT #0: nam='log file sync' ela= 1331 p1=31455 p2=0 p3=0
WAIT #0: nam='SQL*Net message to client' ela= 1 p1=1952673792 p2=1 p3=0

6. If the commit frequency is higher, then LGWR could be essentially spending time signalling user process. Just like any other process, unix scheduler can kick out LGWR off the CPU and this can cause foreground process to wait for ‘log file sync’ event. In Solaris, prstat microstat accounting (-mL) is quite useful in breaking down how much time is spent by LGWR waiting for CPU.

7. Although uncommon, there are few bugs causing unnecessary ‘log file sync’ waits, signalling incorrect processes etc.

8. In few scenarios, ‘log file sync’ waits are intermittent and hard to catch. Statistics might need to be captured with more granularity to understand the issue. For e.g., in the example below, we had instance freeze intermittently and we wrote a small script to gather statistics from v$sysstat every 20 seconds and spool to a file. Also, we were collecting iostat using Oracle supplied tool LTOM.

23-MAR-2007-04:29:43:Redo blocks written:1564176614:Delta: 8253
23-MAR-2007-04:30:14:Redo blocks written:1564176614:Delta:    0
23-MAR-2007-04:30:44:Redo blocks written:1564205771:Delta:29157

Between 4:29:43 and 4:30:14 redo blocks written statitics did not change since delta is 0. Meaning, LGWR did not write any redo blocks. But, ‘redo size’ statistics during the same duration was at ~7MB in that instance. Number of processes waiting for ‘log file sync’ event increased to 100. So, we know that freeze is because LGWR wasn’t able to write any blocks during this time frame.

In a RAC cluster, this problem is magnified, since both CR and CUR mode buffers needs log flush to complete before these blocks can be transferred to other instance’s cache.

Looking closely at I/O statistics, we could see that average service time was quite high for few devices during that time period (we mapped these devices back to log file systems later). This lead to more granular time frame and finally, it was an hardware switch issue intermittently freezing.

extended device statistics

r/s  w/s kr/s kw/s wait actv wsvc_t asvc_t %w  %b device
0.0  0.0  0.0  0.0  0.0  1.0    0.0    0.0  0 100 d6
0.0  0.0  0.0  0.0  0.0  1.0    0.0    0.0  0 100 d2
0.0  0.0  0.0  0.0  0.0  1.0    0.0    0.0  0 100 d1
0.0  0.0  0.0  0.0  0.0  9.0    0.0    0.0  0 100 d0
0.0  0.0  0.0  0.0  0.0  3.0    0.0    0.0  0 100 d13
0.0  2.0  0.0 24.0  0.0  2.0    0.0 1000.7  0 100 d12
0.0  0.0  0.0  0.0  0.0  1.0    0.0    0.0  0 100 d11

9. Unix tools such as truss,tusc and strace can be used to debug those scenarios if above techniques are not sufficient. But, tools such as truss and strace should be used as a last resort. In Solaris 10, dtrace can be effectively used to debug I/O or processor specific issues. Dtrace is safer by design. Brendangregg has dtrace tool kit and wealth of information here

Guide lines for resolving few root causes

Finding and understanding root cause is essential to resolve a performance issue.

1. If I/O bandwith is an issue, then doing anything other than improving I/O bandwidth is not useful. Switching to file systems providing better write throughput is one option. RAW devices are another option. Reducing # of log file members in a group is another option as it reduces # of write calls. But, this option comes with a cost.

2. If CPU starvation is an issue, then reducing CPU starvation is the correct step to resolve it. Increasing priority of LGWR is a work around.

3. If commit rate is higher, then decreasing commits is correct step but, in few case, if that is not possible, increasing priority of LGWR (using nice) or increasing priority class of LGWR to RT might provide some relief.

4. Solid State Disk devices also can be used if redo size is extreme. In that case, it is also preferable to decrease redo size.

5. If excessive redo size is root cause, redo size can be reduced using various techniques. More information can be found
here

In summary, finding and resolving root cause for log file sync event, would improve application response time. This blog can be found as pdf file
here

redo_internals_scripts

This entry was posted on July 7, 2008 at 11:49 pm and is filed under Performance tuning. Tagged: LGWR, log file sync, oracle, performance. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

62 Responses to “Tuning ‘log file sync’ wait events”

Hans-Peter Sloot said

November 6, 2008 at 9:12 am
Hi Riyaj,

I am currently investigating performance problem of an APPS database.
The average log file sync wait time is about 10 to 20 times the log file parallel write time.

I thought that I was hitting bug 2669566 (its 9.2.0.8) but snapper shows that the log file parallel write time must be correct.

I always thought that a commit resulted in always 1 log file parallel write. Statspack however shows that the number of log file parallel writes is more than 2 times the number or log file syncs.

Do you have any clue about why log file sync is so much higher than the log file parallel write?
If 1 log file sync result in about 2 log file parallel write I would expect the ratio to be a bit higher than 2 but not about 10.

Regards Hans-Peter

Reply
orainternals said

November 6, 2008 at 3:09 pm
Hello Hans-Peter
Thanks for visiting my blog.
LGWR processing is much more than just writes. It is quite possible that another phase of LGWR processing has a bottleneck.
Since this version is 9.2.0.8, ASH can’t be used. Do you have any other issues in OS? High CPU Usage ?Paging, swapping etc? Is LGWR getting enough CPU?
What OS are you using? Depending upon that, You could potentially use couple of methods:

1. SQLTRACE lgwr to see if there are any other issues. [ Remember SQLTRACE can slow down LGWR ]
2. Truss or strace lgwr to see if there are any bottlenecks [ Very dangerous since truss or strace can cause LGWR crash and instance crash ]
3. If you are in Solaris 10, then dtrace will be much better option to debug LGWR.

Hope this helps,

Cheers
Riyaj

Reply
Hans-Peter Sloot said

November 7, 2008 at 8:26 am
Hi Riyaj,

Infact Jonathan Lewis linked me through to your article. I didn’t even notice that I was on your blog.
(You can take that as a compliment)
Well it is on Solaris10 but I do not have root access.

I did some arithmetic taking a random snapshot from perfstat
nr log file sync/s : 17.17
nr commits/s : 29.27
nr log file parallel writes/s : 42.99
nr redo writes/s : 11.40
log file parallel write time/s : 343.76 ms
redo write time/s : 352 ms
log file parallel write time/write : 7.99 ms
log file sync time /sync : 153.07 ms

Observations and questions:
1. The number of commits/s does not match the number of log file syncs/s??????
2. The number of log file parallel writes/s is more than 2 times the number of log file sync/s
Almost everyone assumes that because of the piggy backing there is 1 parallel write for 1 or more log file syncs.

Since there are 2 lfpw per sync I would expect the log file sync time to be about 2 times the lfpw time.
But is it way way higher 153ms instead of 2 * 8ms.
I will see if I can have root privileges.

Do you have an explanation for the difference between the commit/s and log file sync/s.

Regards Hans-Peter

Reply
Log file synch tuning #2 « Oracle database internals said

November 24, 2008 at 3:15 pm
[…] dtruss, log file sync, oracle, performance, pfiles, pstack After reading my earlier blog about log file sync tuning a reader sent an interesting issue, worth blogging about […]

Reply
Dan Mamur said

November 29, 2008 at 8:20 am
Hi,

<>

When you say 60 minutes statspack report with 8 CPU’s, isnt’ the statspack report is for “28800” CPU seconds and not 14400. Also CPU usage drops to 25%. Please correct me if I am wrong.

Reply
Kam said

December 5, 2008 at 7:18 pm
I had a bit of trouble finding the sid for my LGWR process. Here is the sql I came up with instead of hard coding to 3.

select
e.sid
,e.event
,e.time_waited
,e.time_waited_micro
from
v$session_event e
,v$session s
where
e.sid=s.sid
and s.program like ‘%LGWR%’
order by
3
;

Thanks for the posting.
-Kam

Reply
Kay Guo said

February 11, 2009 at 7:09 pm
best article I’ve read so far on log file sync. Thanks Riyaj. I have one follow up question, I have a data guard setup with high wait on log file sync. when querying
select sid, event, time_waited, time_waited_micro
from v$session_event
where sid=4 — sid of LGWR
order by 3

I got these results –
EVENT TIME_WAITED TIME_WAITED_MICRO
direct path read 0 1284
direct path write 1 6146
enqueue 12 123655
control file sequential read 64 635605
latch free 85 853908
log file sequential read 129 1286032
log file single write 271 2710066
process startup 393 3932119
control file parallel write 507 5069756
LGWR wait for redo copy 5354 53544478
LGWR wait on full LNS buffer 245762 2457615221
log file parallel write 713357 7133574722
LGWR wait on LNS 6582128 65821282499
LGWR-LNS wait on channel 6584458 65844584519
rdbms ipc message 23038515 230385146069

My question is – do wait on LGWR-LNS wait on channel or LGWR wait on LNS contribute to log_file sync?
Thanks.

Reply
- orainternals said
  
  February 13, 2009 at 1:26 am
  Hi Kay
  Thank you for reading my blog and sorry for the late reply.
  I don’t really know your dataguard setup. So, I have more questions then answers 😦
  Are you using SYNC or ASYNC mode? Looking at your output, it looks like, you may be able to improve performance of DG increasing network buffer size.
  Can you also please post few lines from statspack? And more importantly, how good is your network throughput and latency?
  
  Cheers
  Riyaj
  
  Reply
Kay Guo said

February 24, 2009 at 6:20 pm
sorry for the late response. I signed up for email notification if/when you replied to the thread, but never seemed to got the email.

My main goal is to find out whether ‘LGWR wait on LNS’ is part of ‘log file sync’ wait. Based on the following observations, I think the answer is yes. Please let me know if it’s incorrect or inaccurate.

Our data guard setup is LGWR ASYNC. My interpretation of this issue is that, no matter it’s LGWR SYNC (where LGWR waits for the acknowledgement of RFS process on the remote server before posting back to the foreground/user session that the write has completed) or LGWR ASYNC (where LGWR wait for the acknowledgement of the LNS process on the local server before posting back to user session), network latency does contribute to log file sync. In the case of ASYNC, the impact is indirect as LGWR will only get delayed if the LNS buffer is full.

When we had network latency issue
/usr/sbin/ping -s cmc9asff6
PING cmc9asff6: 56 data bytes
64 bytes from cmc9asff6.ny.jpmorgan.com (10.3.61.13): icmp_seq=0. time=916. ms
64 bytes from cmc9asff6.ny.jpmorgan.com (10.3.61.13): icmp_seq=1. time=111. ms
64 bytes from cmc9asff6.ny.jpmorgan.com (10.3.61.13): icmp_seq=2. time=127. ms

the top events in statspack are
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
——————————————– ———— ———– ——–
log file sync 7,105 6,864 70.46
CPU time 711 7.30
LGWR wait on LNS 211,478 603 6.19
LGWR-LNS wait on channel 105,687 603 6.19
LNS wait on SENDREQ 104 600 6.16
————————————————————-

When network resumed to normal
64 bytes from mcc13asff3.mcc.del.jpmorganchase.com (10.31.73.17): icmp_seq=486. time=3. ms
64 bytes from mcc13asff3.mcc.del.jpmorganchase.com (10.31.73.17): icmp_seq=487. time=4. ms
64 bytes from mcc13asff3.mcc.del.jpmorganchase.com (10.31.73.17): icmp_seq=488. time=4. ms
Statspace read this
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
——————————————– ———— ———– ——–
CPU time 7,709 64.01
db file scattered read 383,007 2,624 21.79
db file sequential read 169,998 1,116 9.27
log file parallel write 323,416 261 2.16
LNS wait on SENDREQ 1,019 139 1.15
————————————————————-

Metalink note 34592.1 seems to confirm this interpretation –
step 3 (Time for the log write I/O to complete
)also includes the time for the network write and the RFS/redo write to the standby redo logs.

Thanks for your time and answers.

Kay

Reply
- orainternals said
  
  February 24, 2009 at 6:51 pm
  Hello Kay
  
  From your statspack, I think, you are correct. Even though LGWR ASYNC mode prevents primary from encountering ‘log file sync’ waits due to network latency, it doesn’t necessarily prevent when LGWRLNS communication and messaging has latencies. In your case, your statspack output indicates that you may be running in to issues with LNS buffer flushing or LNS communication not completing quick enough. I guess, truss or dtrace of LGWR will confirm this.
  
  I guess, LNS process tuning is in order to resolve this intermittent issue. Of course, network latency is another area to look at.
  
  Cheers
  Riyaj
  
  Reply
Kay Guo said

February 24, 2009 at 9:00 pm
Thanks.. LGWR-LNS communication and messaging latency is a more accurate way of putting it. In our case, this latency happened to be caused by network. Noted your point that there could be other reasons that slow down LNS too. Thanks for your help.

Reply
A_R_T_Y_E_M said

February 25, 2009 at 7:18 pm
Thanks for this post, and sorry for my bad English.
I have top waits:
controlfile sequential read;
log file sync
enq: CF – contention.
Are you have a idea, where is bottleneck? and what to do?
every hour – RMAN with ‘backup archivelog all delte input all’ (to ahother server by NFS). RMAN work without catalog.
Also, DB monitored by ControlGrid (user DBSNMP).
Thanks.

Reply
- orainternals said
  
  February 25, 2009 at 9:00 pm
  Hi
  I am afraid that we need more information to debug this issue. Can you please print top 5 wait events from statspack or AWR report here and also let us know duration of statspack or AWR report? That should be a good starting point to debug this.
  
  My English language skills are bad too. I am sure native English language speakers will identify grammatical errors in my blog entries within seconds 🙂
  
  Cheers
  Riyaj
  
  Reply
A_R_T_Y_E_M said

February 25, 2009 at 9:22 pm
Hi, thanks for answer 🙂

Begin Snap: 7961 25-Feb-09 00:00:46
End Snap: 7976 25-Feb-09 15:00:12

Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait Class
—————————— ———— ———– —— —— ———-
CPU time 16,498 62.7
control file sequential read 10,561,558 5,342 1 20.3 System I/O
db file sequential read 351,668 1,877 5 7.1 User I/O
log file sync 80,092 1,598 20 6.1 Commit
db file scattered read 314,699 1,126 4 4.3 User I/O
….
Avg
%Time Total Wait wait Waits
Event Waits -outs Time (s) (ms) /txn
enq: CF – contention 1,180 15.3 911 772 0.0

….

Enqueue Activity DB/Inst:XXXXXXXXX Snaps: 7961-7976
-> only enqueues with waits are shown
-> Enqueue stats gathered prior to 10g should not be compared with 10g data
-> ordered by Wait Time desc, Waits desc

Enqueue Type (Request Reason)
——————————————————————————
Requests Succ Gets Failed Gets Waits Wt Time (s) Av Wt Time(ms)
———— ———— ———– ———– ———— ————–
CF-Controlfile Transaction
93,938 93,352 586 881 932 1,058.24
TX-Transaction (row lock contention)
15 15 0 4 85 21,160.00
TC-Tablespace Checkpoint
3 3 0 1 4 4,030.00
TX-Transaction (index contention)
1,261 1,261 0 2 0 145.00
————————————————————-

…

End value
Parameter Name Begin value (if different)
—————————– ——————————— ————–
_log_io_size 256
_pga_max_size 905932800
_smm_max_size 442350
archive_lag_target 900
audit_file_dest /opt/oracle/admin/XXXX/adump
audit_sys_operations TRUE
audit_trail DB
background_dump_dest /opt/oracle/admin/XXXX/bdump
compatible 10.2.0.3.0
control_file_record_keep_time 15
control_files +DGROUP1/xxxx/controlfile/current
core_dump_dest /opt/oracle/admin/XXXX/cdump
db_block_size 8192
db_create_file_dest +DGROUP1
db_create_online_log_dest_1 +DGROUP1
db_create_online_log_dest_2 /vol/u00/oradata
db_domain ???????????????
db_file_multiblock_read_count 16
db_flashback_retention_target 4320
db_name XXXX
db_recovery_file_dest /vol/u00/oradata
db_recovery_file_dest_size 214748364800
db_unique_name XXXX
fal_client XXXX__PRD1
fal_server XXXX_PRD2
fast_start_mttr_target 90
filesystemio_options SETALL
job_queue_processes 50
log_archive_dest_1 LOCATION=USE_DB_RECOVERY_FILE_DES
log_archive_dest_2 LOCATION=+DGROUP1
log_archive_dest_3 SERVICE=XXXXX_PRD2 LGWR ASYNC=2048
log_archive_dest_state_3 ENABLE
log_archive_format CCDB_%T_%S_%R.arc
log_archive_max_processes 8
log_buffer 30519296
log_checkpoints_to_alert FALSE
open_cursors 300
pga_aggregate_target 1509949440
plsql_native_library_dir /opt/oracle/product/10.2.0/db/lib
processes 400
query_rewrite_enabled FALSE
recyclebin OFF
remote_login_passwordfile EXCLUSIVE
service_names ????????????????????
sessions 445
sga_max_size 4294967296
sga_target 4294967296
standby_file_management AUTO
streams_pool_size 0
undo_management AUTO
undo_retention 10800
undo_tablespace UNDOTBS1
user_dump_dest /opt/oracle/admin/CCDB/udump
————————————————————-

Thanks

Reply
- orainternals said
  
  February 26, 2009 at 4:02 pm
  Hello
  First of all, AWR duration of 15 hours too long. AWR report is an aggregated report and so, it is very hard to decipher details based upon aggregated data.
  
  What is that you are trying to tune? A specific business process?
  
  I will try to help.
  1. It looks like write performance to log files are not great. 12ms average for log file parallel write is not sufficient. Couple of things you can do to improve throughput to log files:
  a) Try using direct and asynch I/O. I am not sure, whether underlying file system supports direct I/O or not. You might want to research that. I do see that filesystemio_options is set to setall, which means that Oracle will try to use asynch and direct I/O if the file system supports it.
  Avg
  %Time Total Wait wait Waits
  Event Waits -outs Time (s) (ms) /txn
  —————————- ————– —— ———– ——- ———
  ..
  log file parallel write 90,505 .0 1,114 12 1.2
  log file sequential read 82,154 .0 1,057 13 1.1
  ..
  control file parallel write 48,544 .0 326 7 0.6
  …
  control file single write 8,255 .0 154 19 0.1
  ..
  LGWR-LNS wait on channel 4,380 100.0 47 11 0.1
  ..
  
  2. Same with control files too. control file single writes are taking an average of 19ms. Much of the control file writes are performed under the protection of CF enqueue and so, CF enqueue waits may be related to this issue.
  3. Looks like LGWR ASYNC is used for log archive to standby. You might want to consider tuning LNS process also. Simply even increasing buffer size might be a good start.
  4. _log_io_size is set to 256. I am not sure why. In most platforms that will result in triggering LGWR after 128KB of redo. That might trigger hyperactive LGWR. I don’t know your application well enough to recommend something here. Check the ratio between redo size and user commits + user rollbacks to see if that parameter value is warranted.
  5. Last but not least, there are bugs in 10.2.0.3 with LGWR and log file sync events. In some cases, LGWR is not efficient enough when the workload is not so high. Read 34592.1 for further information. You might want to think about 10.2.0.4 upgrade. Of course, test your upgrade thoroughly 🙂
  
  In summary, this goes to basic things. Improve write throughput to redo log files and control files. Eliminate bugs with better software etc.
  
  It is also critical to review this again with an AWR report of 30 minutes or so.
  
  Reply

Chan said

May 12, 2009 at 2:38 am

Hi Riyaj,

We had slow performance on the database and the requests from app server are queued up. I looked at AWR and ASH reports. I saw log file sync in one top top 5 wait event in AWR and it is listed as Blocking session in ASH report. THe blocking session with log file sync by LGWR was there the same time period of problem. I do not see high CPU usage or user commits during that time.


Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time                                          6,458          90.2
db file sequential read             318,605         644      2    9.0   User I/O
log file sync                        22,952          96      4    1.3     Commit
log file parallel write              23,407          80      3    1.1 System I/O
read by other session                15,400          53      3    0.7   User I/O

Statistic                                       Total
-------------------------------- --------------------
AVG_BUSY_TIME                                 236,562
AVG_IDLE_TIME                                 125,022
AVG_IOWAIT_TIME                                 7,748
AVG_SYS_TIME                                   44,124
AVG_USER_TIME                                 192,347
BUSY_TIME                                     946,610
IDLE_TIME                                     500,474
IOWAIT_TIME                                    31,357
SYS_TIME                                      176,865
USER_TIME                                     769,745
LOAD                                                3
OS_CPU_WAIT_TIME                              133,200
RSRC_MGR_CPU_WAIT_TIME                              0
VM_IN_BYTES                                   999,424
VM_OUT_BYTES                                        0
PHYSICAL_MEMORY_BYTES                  33,569,898,496
NUM_CPUS                                            4
          -------------------------------------------------------------

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
total number of times SMON poste                 20            0.0           0.0
transaction rollbacks                            77            0.0           0.0
transaction tables consistent re                  0            0.0           0.0
transaction tables consistent re                  0            0.0           0.0
undo change vector size                 110,010,444       30,452.7       6,225.5
user I/O wait time                           69,807           19.3           4.0
user calls                               12,892,591        3,568.9         729.6
user commits                                 17,437            4.8           1.0
user rollbacks                                  234            0.1           0.0


But I see log file sync under Blocking session in ASH

Top Blocking Sessions        DB/Inst: COMMON5/common5  (May 10 15:15 to 15:30)
-> Blocking session activity percentages are calculated with respect to
      waits on enqueues, latches and "buffer busy" only
-> '% Activity' represents the load on the database caused by
      a particular blocking session
-> '# Samples Active' shows the number of ASH samples in which the
      blocking session was found active.
-> 'XIDs' shows the number of distinct transaction IDs sampled in ASH
      when the blocking session was found active.

   Blocking Sid % Activity Event Caused                      % Event
--------------- ---------- ------------------------------ ----------
User                 Program                          # Samples Active     XIDs
-------------------- ------------------------------ ------------------ --------
      771,    1       4.12 log file sync                        4.12
                                                           0/90 [  0%]        0

Did it cause the performance issue on the database?

orainternals said

May 12, 2009 at 2:11 pm
Hello Chan

What is the duration of your AWR report? It may be for 1 hour or 30 minutes and so, aggregation might be hiding the details. Even ASH report of 15 minutes is not useful.

This might needs little bit more research using v$active_session_history.

(i) Try to find what sessions were waiting for using the script: http://www.orainternals.com/scripts_perf1.php – ash_top20.sql (or) use this query:
```
select * from (
select event,  inst_id, 
   sum(decode(ash.session_state,'WAITING',1,0)) cnt_waiting
from  gv$active_session_history ash
where sample_time between sysdate - (&&start_min /( 60*24) ) and sysdate-(&&end_min/(60*24))
group by event,inst_id
order by 3 desc
) where rownum <=30
/
```
(ii) If data in v$active_session_history is still available, review LGWR activity using ash_lgwr.sql script.

(iii) Create ash report for very specific duration:
```
select output from table(dbms_workload_repository.ash_report_text( (select dbid from v$database),
                             1,
                             sysdate- &&start_min/(24*60),
                             sysdate-&&end_min,
                             0))
/
```
Reply
- orainternals said
  
  May 12, 2009 at 2:23 pm
  I forgot to point out that AWR report has OS_CPU_WAIT_TIME of 133,200. USER_TIME is 769,745. So, at some point during this AWR report duration, there were waits for CPUs. It is possible that LGWR was also affected by this high CPU usage in the server. That ,in turn, can lead to massive waits for ‘log file sync’ event.

Daniel said

August 6, 2009 at 5:27 am
“Then, process goes to sleep with semtimedop call, in its own semaphore.
Semaphore set id is 9, but semnum is 118 which is for the user process I was tracing.”

How to know semnum 118 is the user process traced? Any method to map a process with a set id and semnum? I am trying to map the semnum to oracle pid from v$process, but it can’t map correctly.

Reply
- orainternals said
  
  August 8, 2009 at 4:06 am
  Hello Daniel
  I need to think about this little bit. It is possible to do that by hooking a simple dtrace script with sem* calls and print them. I am trying to see if there is an easier way.
  
  Cheers
  Riyaj
  
  Reply

Muthu said

August 7, 2009 at 10:35 pm

Riyaj,

Wonderful article and happy to read you. We’ve an issue with the same wait event. Here’s our scenario :

We’re running Oracle EBS 11i in AIX 5.3 (16 CPU and 60 Gb RAM) SGA is 3G and log_buffer is 10 M. Database is running in no archive log mode.

This is our breakfix environment and only one concurrent request is running for hours, waiting on this and ‘db file sequential read’ on an Insert statement.

I collected the stats as follows :

  SID  STAT# NAME                                      VALUE
----- ------ ------------------------------ ----------------
   95    115 redo size                          -708,973,184
   95      5 user rollbacks                                0
   95    117 redo wastage                                  0
   95    118 redo writer latching time                     0
   95    121 redo write time                               0
   95    125 redo ordering marks                           0
   95    124 redo log switch interrupts                    0
   95    120 redo blocks written                           0
   95    119 redo writes                                   0
   95      0 logons cumulative                             1
   95      1 logons current                                1
   95      4 user commits                                  3
   95    116 redo buffer allocation retries               30
   95    122 redo log space requests                      38
   95      6 user calls                                   75
   95    123 redo log space wait time                    425
   95     73 redo synch time                       2,265,549
   95     72 redo synch writes                    18,909,918
   95    114 redo entries                         60,812,763

Output from Tanel's tool : (edited for clarity, of course)

----------------------------------------------------------------
 SID, TYPE, STATISTIC                  DELTA, HDELTA/SEC,  %TIME
----------------------------------------------------------------
  95, WAIT, log file sync            2076179,   692.06ms,  69.2%
  95, WAIT, db file sequential read    58498,     19.5ms,   1.9%
--  End of snap 1, end=2009-08-07 18:06:06, seconds=3

User commits is low; so is user calls and redo blocks written is 0. Only redo synch time, redo synch writes and redo entries are huge.

(And the redo size is a negative number. Metalink says this is a bug in the pre-10g releases (ours is 9.2.0.7) and asked to refer the ‘redo blocks written’)

I had checked the CPU usage. In the morning, there was a request set that laucnhed 5 child threads and they all sucked up the CPU to the tune of more than 50% and I waited for their completion. But even after the completion of them and the CPU usage came down to under 10 %, the issue remained the same.

I had discussed with our OS sysadmin and he had involved the SAN guys and confirmed that, though the filesystem is active, it’s quite normal.

Can you help me what to look for, further ?

Thanks
Muthu

orainternals said

August 8, 2009 at 4:01 am
Hello Muthu
I am really not sure what problem you are trying to address here? Are you trying to tune that concurrent request? or Are you trying to identify why there is high CPU usage?

If you are trying to tune the request, I guess, redo size need to be reduced (or) LGWR & its components need to be tuned. Since there are 5 requests running, all of them may be generating much redo. Is this a custom program or standard seeded program? Is there a possibility of reducing redo size by dropping few indices on that table and then rebuild them later? Look for the opportunities to tune them.

Also, all these inserts, are they inserts with bind variables? or ‘insert into table select * from table1’ form of SQLS?

How many Log members you have? What type of file system? How big are the log files?

Sorry to ask you more questions, but I would like to understand the root cause before giving you some suggestions.

Cheers
Riyaj

Reply
Muthu said

August 10, 2009 at 6:57 pm
Hello Riyaj,

Thanks for the quick update. I’m trying to address the concurrent request that took forever due to the ‘log file sync’ wait event. I referred the other request set and it’s CPU usage (which’s not what ppl. are worrying about), because you had asked to check the CPU usage (guideline # 2).

Here’re the answers for your questions :

This is the standard Oracle report (Journal Entries Report). It’s a good point to drop/rebuild the indices to reduce the redo generation. I’ve never thought about that. Since this is not a real production, I think, we can think about it.

It’s a straight Insert statement with bind variables.

We’ve 4 log members (4 group with one each) of 1.5 Gb each. Ours is a JFS2 type of file system.

The issue here is that, this is the break fix environment that’s being refreshed every night from the copy from production.

Thanks once again and cheers
Muthu

Reply
orainternals said

August 14, 2009 at 3:27 am
Hello Muthu
Sorry, it took a while to respond.
Do you have tkprof for this report run? Is the parent job slow or report is slow? What is the short_name?
There have been many performance issues reported for journal entries report. One prominent bug I see is that parent job is deleting from ar_journal_interim. Bug fix was to truncate that table instead of delete. Bug 2830651.
I guess, we will have to look at tkrpof output files to see where the slowness. If you are trying to tune the instance for log file sync issues, I guess, converting delete statement to truncate statement might help.

Cheers
Riyaj

Reply
推荐几篇不错的解释Oracle基础概念的文章 | a db thinker's home said

February 22, 2010 at 7:45 am
[…] Tuning ‘log file sync’ wait events By Riyaj Shamsudeen […]

Reply
Kerry Osborne’s Oracle Blog » Blog Archive Increasing Priority of lgwr Process using _high_priority_processes - Kerry Osborne’s Oracle Blog said

April 10, 2010 at 3:02 am
[…] a couple of pretty good ones: Manly-Men Only Use Solid State Disk for Redo Logging – Kevin Closson Tuning Log File Sync Wait Events – Riyaj Shamsudeen Hotsos 2010 – Day 4 – Doug Burns (see the comments […]

Reply
OCI*ML new feature: AQ | So I decided to take my work back underground said

May 26, 2011 at 1:48 pm
[…] which in turn is spending most of its time on log file sync wait to a slow NFS disk, and autocommit is not supported for AQ so there is additional […]

Reply
Log file synch tuning #2 - OraInternals said

January 1, 2012 at 12:33 am
[…] reading my earlier blog about log file sync tuning a reader sent an interesting issue, worth blogging about […]

Reply
Parvez Makhani said

March 28, 2012 at 12:04 pm
hi
this is a very nice blog.like this can i get the discryption of all the wait events.so it will be easier for me to analyze my AWR reports.
thanks if u can help me…..

Parvez Makhani

Reply
- Riyaj Shamsudeen said
  
  March 29, 2012 at 10:03 pm
  You will have to refer to Database Reference guide to understand wait events.
  
  Thanks
  Riyaj
  
  Reply
quirkyoracle said

April 30, 2012 at 12:59 pm
Hi Riyaj

Great article, thanks. I’ve been playing around with strace and lsof +fG on my linux server tracing the lgwr process but cannot find any evidence that redo data is opened/written in DSYNC mode as you mention above:

“…redo log files are opened with DSYNC flag”.

How can we confirm DSYNC is indeed being used?

Reply
- Riyaj Shamsudeen said
  
  April 30, 2012 at 3:27 pm
  Hello
  In Solaris platform, pfiles will show that files are opened in DSYNC flag. Here is a small example from a Solaris VM.
  
  Truss of LGWR shows that LGWR is submitting Asynch I/O :
  kaio(AIOWRITE, 261, 0x602A4600, 512, 0xFC73DDD00D684A00) = 0
  
  file_id 261 of LGWR process through pfiles command is:
  
  261: S_IFCHR mode:0755 dev:291,0 ino:15729286 uid:601 gid:503 rdev:30,321
  O_RDWR|O_NONBLOCK|O_DSYNC|O_LARGEFILE FD_CLOEXEC
  /devices/iscsi/disk@0000iqm.demo.volumes-san0001,4:b,raw
  
  As you see, LGWR has opened the file with DSYNC flag.
  
  I need to probe Linux to see how to do this, but I would imagine lsof should do the trick.
  
  Cheers
  Riyaj
  
  Reply
AWR report: load profile « Oracle Diagnostician said

August 13, 2012 at 4:50 am
[…] Excessive commits can lead to performance problems via log file sync waits . […]

Reply
Troubleshooting ‘Log File Sync’ Waits « Oleksandr Denysenko's Blog said

November 2, 2012 at 11:33 am
[…] Riyaj Shamsudee: “Tuning ‘log file sync’ wait events” […]

Reply
Selvam said

January 9, 2013 at 8:56 am
Thanks Riyaj for ver good post.
Thanks,
Selvam

Reply
Deem said

March 13, 2013 at 5:50 pm
Hello
I really need your help
Se the below addm report

ADDM Report for Task ‘TASK_51215’
———————————

Analysis Period
—————
AWR snapshot range from 41947 to 41959.
Time period starts at 12-MAR-13 06.00.45 AM
Time period ends at 12-MAR-13 06.00.02 PM

Analysis Target
—————
Database ‘PR1′ with DB ID 759962060.
Database version 11.2.0.2.0.
ADDM performed an analysis of instance PR1, numbered 1 and hosted at cnuspp01.

Activity During the Analysis Period
———————————–
Total database time was 1931342 seconds.
The average number of active sessions was 44.75.

Summary of Findings
——————-
Description Active Sessions Recommendation
s
Percent of Activity
—————————————- ——————- ————–
–
1 Commits and Rollbacks 29.38 | 65.66 1
2 I/O Throughput 12.78 | 28.55 2
3 Undersized Buffer Cache 3.23 | 7.22 1
4 Top Segments by “User I/O” and “Cluster” .99 | 2.2 1

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Findings and Recommendations
—————————-

Finding 1: Commits and Rollbacks
Impact is 29.38 active sessions, 65.66% of total activity.
———————————————————-
Waits on event “log file sync” while performing COMMIT and ROLLBACK operations
were consuming significant database time.

Recommendation 1: Host Configuration
Estimated benefit is 29.38 active sessions, 65.66% of total activity.
———————————————————————
Action
Investigate the possibility of improving the performance of I/O to the
online redo log files.
Rationale
The average size of writes to the online redo log files was 176 K and
the average time per write was 255 milliseconds.
Rationale
The total I/O throughput on redo log files was 636 K per second for
reads and 1.2 M per second for writes.
Rationale
The redo log I/O throughput was divided as follows: 0% by RMAN and
recovery, 66% by Log Writer, 0% by Archiver, 0% by Streams AQ and 33% by
all other activity.

Symptoms That Led to the Finding:
———————————
Wait class “Commit” was consuming significant database time.
Impact is 29.38 active sessions, 65.66% of total activity.

Finding 2: I/O Throughput
Impact is 12.78 active sessions, 28.55% of total activity.
———————————————————-
The throughput of the I/O subsystem was significantly lower than expected.

Recommendation 1: Host Configuration
Estimated benefit is 12.78 active sessions, 28.55% of total activity.
———————————————————————
Action
Consider increasing the throughput of the I/O subsystem. Oracle’s
recommended solution is to stripe all data files using the SAME
methodology. You might also need to increase the number of disks for
better performance.
Rationale
During the analysis period, the average data files’ I/O throughput was
1.3 M per second for reads and 945 K per second for writes. The average
response time for single block reads was 89 milliseconds.

Recommendation 2: Host Configuration
Estimated benefit is 2.4 active sessions, 5.36% of total activity.
——————————————————————
Action
The performance of some data and temp files was significantly worse than
others. If striping all files using the SAME methodology is not
possible, consider striping these file over multiple disks.
Rationale
For file /oracle/PR1/sapdata2/dat_6/dat.data6, the average response time
for single block reads was 265 milliseconds, and the total excess I/O
wait was 22709 seconds.
Related Object
Database file
“/oracle/PR1/sapdata2/dat_6/dat.data6”
Rationale
For file /oracle/PR1/sapdata2/dat_10/dat.data10, the average response
time for single block reads was 279 milliseconds, and the total excess
I/O wait was 20281 seconds.
Related Object
Database file
“/oracle/PR1/sapdata2/dat_10/dat.data10”
Rationale
For file /oracle/PR1/sapdata1/dat_5/dat.data5, the average response time
for single block reads was 249 milliseconds, and the total excess I/O
wait was 20268 seconds.
Related Object
Database file
“/oracle/PR1/sapdata1/dat_5/dat.data5”
Rationale
For file /oracle/PR1/sapdata4/dat_21/dat.data21, the average response
time for single block reads was 124 milliseconds, and the total excess
I/O wait was 20230 seconds.
Related Object
Database file
“/oracle/PR1/sapdata4/dat_21/dat.data21”
Rationale
For file /oracle/PR1/sapdata3/dat_12/dat.data12, the average response
time for single block reads was 251 milliseconds, and the total excess
I/O wait was 20001 seconds.
Related Object
Database file
“/oracle/PR1/sapdata3/dat_12/dat.data12”

Symptoms That Led to the Finding:
———————————
Wait class “User I/O” was consuming significant database time.
Impact is 13.61 active sessions, 30.41% of total activity.

Finding 3: Undersized Buffer Cache
Impact is 3.23 active sessions, 7.22% of total activity.
——————————————————–
The buffer cache was undersized causing significant additional read I/O.
The value of parameter “db_cache_size” was “13568 M” during the analysis
period.

Recommendation 1: Database Configuration
Estimated benefit is .63 active sessions, 1.41% of total activity.
——————————————————————
Action
Increase the buffer cache size by setting the value of parameter
“db_cache_size” to 14784 M.

Symptoms That Led to the Finding:
———————————
Wait class “User I/O” was consuming significant database time.
Impact is 13.61 active sessions, 30.41% of total activity.

Finding 4: Top Segments by “User I/O” and “Cluster”
Impact is .99 active sessions, 2.2% of total activity.
——————————————————
Individual database segments responsible for significant “User I/O” and
“Cluster” waits were found.

Recommendation 1: Segment Tuning
Estimated benefit is .99 active sessions, 2.2% of total activity.
—————————————————————–
Action
Investigate application logic involving I/O on TABLE “SAPDAT.INDX” with
object ID 355222.
Related Object
Database object with ID 355222.
Rationale
The I/O usage statistics for the object are: 0 full object scans,
2687152 physical reads, 2462181 physical writes and 0 direct reads.

Symptoms That Led to the Finding:
———————————
Wait class “User I/O” was consuming significant database time.
Impact is 13.61 active sessions, 30.41% of total activity.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Additional Information
———————-

Miscellaneous Information
————————-
Wait class “Application” was not consuming significant database time.
Wait class “Concurrency” was not consuming significant database time.
Wait class “Configuration” was not consuming significant database time.
CPU was not a bottleneck for the instance.
Wait class “Network” was not consuming significant database time.
Session connect and disconnect calls were not consuming significant database
time.
Hard parsing of SQL statements was not consuming significant database time.

Reply
yousuf said

March 16, 2013 at 5:40 pm
Thanks for nice post..

Really you are amazing…..

Reply
yousuf said

March 16, 2013 at 5:57 pm
Just to get clarity on question..

LGWR process parallel write to both members at same time? (Example: Say Group1 having two members (multiplexing))

Once writing to both members of Group 1 then only LGWR signal / reported back successful to Users process?

Please suggest..

Reply
- Riyaj Shamsudeen said
  
  March 23, 2013 at 7:05 pm
  Hello Yousuf
  LGWR can submit concurrent asynchronous I/O requests to multiple redo log members, if asynchronous I/O is enabled. If not, LGWR will execute write system calls in succession. And, Yes, writes to both members must complete before the commit is declared successful.
  
  Thanks
  Riyaj
  
  Reply
yousuf said

March 23, 2013 at 6:34 pm
Please validate my question..

Reply
odba.eu Blog » Blog Archive » log_file_sync said

July 18, 2013 at 3:00 pm
[…] Here is the article: https://orainternals.wordpress.com/2008/07/07/tuning-log-file-sync-wait-events/ […]

Reply
Umesh said

July 30, 2013 at 5:48 am
Very nice information. Here are few more steps to solve log file sync.

http://www.dbas-oracle.com/2013/07/Solve-log-file-sync-wait-event-and-log-file-parallel-wait-event.html

Reply
Lasse Jenssen (@lasjen) said

March 12, 2014 at 3:34 pm
Thanks for a great article Riyaj (and to Mr. Poder for writing the snapper tool).

I’ve a brief question regarding an output from the snapper utility (sql> @snapper4 all 1 60 lgwr ):
```
--  End of ASH snap 47, end=2014-03-12 14:00:17, seconds=1, samples_taken=1
--------------------------------------------------------------------------------------------------------------------------
TYPE, STATISTIC                    ,         DELTA, HDELTA/SEC,    %TIME, GRAPH       , NUM_WAITS,  WAITS/SEC,   AVERAGES                                    
--------------------------------------------------------------------------------------------------------------------------
STAT, messages received            ,            79,      86.74,         ,             ,          ,           ,
STAT, background timeouts          ,             1,        1.1,         ,             ,          ,           ,
STAT, physical write total IO ...  ,           158,     173.49,         ,             ,          ,           ,
STAT, physical write total mul...  ,           158,     173.49,         ,             ,          ,           ,
STAT, physical write total bytes   ,        411648,       452k,         ,             ,          ,           ,
STAT, redo wastage                 ,         29556,     32.45k,         ,             ,          ,           ,
STAT, redo writes                  ,            80,      87.84,         ,             ,          ,           ,
STAT, redo blocks written          ,           404,     443.61,         ,             ,          ,           ,
STAT, redo write time              ,            21,      23.06,         ,             ,          ,           ,
TIME, background cpu time          ,        290000,   318.43ms,    31.8%, [####      ],          ,           , 
TIME, background elapsed time      ,       1749291,      1.92s,   192.1%, [##########],          ,           , 
WAIT, rdbms ipc message            ,        693308,   761.28ms,    76.1%, [WWWWWWWW  ],        77,      84.55,        9ms 
WAIT, log file parallel write      ,        159587,   175.23ms,    17.5%, [WW        ],        83,      91.14,     1.92ms 
```
The application is several perl processes processing incomming messages on several message queues. There is 4 perl processes (per queue) polling every 10 ms (average of every 2,5 ms). The response time requirement is 100 ms, but we see log file sync waits in trace above 300 ms:
```
WAIT #4: nam='log file sync' ela= 345443 buffer#=7674 p2=0 p3=0 obj#=200024 tim=64401202455671
```
Do you know what the “TIME, background cpu time” & “TIME, background elapsed time” could come from?

The Oracle documentation writes:
background elapsed time: Amount of elapsed time (in microseconds) consumed by database background processes.
background CPU time: Amount of CPU time (in microseconds) consumed by database background processes.

But this does not really explains what’s going on during this time, and why it shows up as “background elapsed/cpu time”?
Or why it accounts for more than 100% of the time (only 1 lgwr process)?

Thanks again Riyaj!

Reply
- Riyaj Shamsudeen said
  
  March 12, 2014 at 8:42 pm
  Hi Lasse
  Background elapsed time and background cpu time comes from v$sess_time_model view. Background processes charge time to background% statistics instead of DB time and DB CPU statistics.
  Background elapsed time ~= background cpu time + background wait time.
  In your example, 31% of time spent on CPU and 76% spent on waiting for idle wait event (rdbms ipc message).
  Is your log file sync wait time consistently that high? Can you print the histogram using the script: https://orainternals.files.wordpress.com/2008/07/event_histogram_delta-sql.doc.
  Also, check LGWR trace file to see if the feature adaptive log file sync is in play. Of course, it would be great if you can post RDBMS version and OS version.
  I have seen numbers not adding up due to various rounding issues with the statistics. In your case, 1second seems to be causing the rounding issues. Can you increase the time window to 10 seconds to see if you can eliminate rounding errors?
  
  Cheers
  Riyaj
  
  Reply
Wait Events | sandarbhoracle said

May 6, 2014 at 8:44 pm
[…] https://orainternals.wordpress.com/2008/07/07/tuning-log-file-sync-wait-events/ […]

Reply
google said

June 27, 2014 at 4:23 pm
We’re a group of volunteers and starting a new scheme
in our community. Your site offered us with valuable info to work
on. You’ve done an impressive job and our entire community will be thankful
to you.

Reply
http://www.google.ca said

July 20, 2014 at 4:02 am
Hi superb website! Does running a blog like this require a
lot of work? I’ve very little understanding of computer programming
however I had been hoping to start my own blog in the near future.
Anyways, if you have any recommendations or tips for new blog owners please share.

I know this is off subject however I just
had to ask. Many thanks!

Reply
Juergen said

September 29, 2014 at 7:42 pm
Hi Riyaj

Great article/blog, thank you very much.

I have the following performance problem – log file sync increases while switching a single tablespace in backup mode (impact not visible for “alter database begin backup”).
In the application (SAP) all workprocesses are in status “commit” 😉 The most impact is switching the UNDO-tablespace in backup mode. I think this depends also on our db_cache_size of 155GB.

Under normal operation the db seq read time is 3ms and the log file sync 7ms – and I see also no “private strand flush”/”cannot allocate new log”/”checkpoint not complete” in alertlog.

Any idea?

Kind regards Juergen

Reply
- Riyaj Shamsudeen said
  
  September 29, 2014 at 9:14 pm
  Hello Juergen
  Thank you for visiting my blog. ‘alter database begin backup’ and ‘alter tablespace begin backup’ commands work differently.
  What database /OS version?
  Begin backup commands should trigger a checkpoint for all the blocks of the tablespace files to be written to the disk, which would trigger DBWR to work harder; DBWR can trigger LGWR to do log file sync.
  To debug this, I would start to understand what the DBWR and LGWR are doing at this time. Since the application is suffering from commit issues, focus on LGWR, review statistics of LGWR session, strace LGWR to see if it is stuck on I/O etc.
  If you don’t mind gmail me a text AWR report to rshamsud, of course, remove all sensitive information.
  Thanks,
  Riyaj
  
  Reply
AWR report: load profile | Oracle Diagnostician said

February 3, 2015 at 1:32 pm
[…] Excessive commits can lead to performance problems via log file sync waits . […]

Reply
Massimiliano Lambertini said

May 4, 2015 at 11:42 am
Hello,

I’ve read your pdf “riyaj_redo_internals_and_tuning_by_redo_reduction_doc.pdf” and found it very interesting, since we are experiencing a similar problem.

We have tuned almost all of our application’s SQL and we are now working upon logswitch.

However, your pdf references a number of sql scripts that can’t be found anymore.

Are these still available? If yes, could you please provide a link?

Thanks in advance, Massimiliano

Reply
- Riyaj Shamsudeen said
  
  May 10, 2015 at 7:04 pm
  Hi
  Thanks for reading. I uploaded a zip file containing all scripts to the blog entry.
  I have not tested them in the recent versions though, however, I do expect them to work. If you find issues, please let me know.
  Thanks..
  
  Reply
  - Massimiliano Lambertini said
    
    May 11, 2015 at 6:34 am
    Thank you for uploading the scripts, I’ll let you know if I incur into problems.
cheap tours london to europe said

December 30, 2015 at 4:42 am
The Singapore Zoo has an open concept, displaying the animals in naturalistic environments.

Reply
Ganesh said

January 31, 2017 at 4:47 am
Hi Riyaj

Great article. We have seen log file parallel write and log file sync simultaneously and most of the sessions were waiting/blocking for committing the records. Database was almost hung and application users not able to process. All were started to complaining about performance issues. At the same time I could see enq cf contention waits on insert query. I know its due to excessive commit causes the log file sync and high CPU utilization.

Could you suggest me what is the temporary fix to run the application stable? Whether restarting the application server is only option to fix the issue? Any other way to fix this kind of hanging issue for temporarily without interrupting application user?

Why am I asking this? some of DBA like me tried to find the root cause of the hanging issue and finally wanted to kill the blocking sessions or wanted to restart the application server. I think its not a right approach for debugging the issue. If it is a production database clients usually don’t want to change anything in the database which needs approval takes time. So in this case what should we do to run the application?

Your suggestions will really help us.

Thanks
Ganesh

Reply
Deepak said

July 27, 2017 at 2:35 pm
Hi,
We are having some performance issue where we are not able to scale up.
log file sync is top event in AWR.

gvv$session_wait_event data are as below :

EVENT TIME_WAITED TIME_WAITED_MICRO
direct path write 0 358
Disk file operations I/O 0 2424
control file single write 0 331
log file sequential read 1 10345
control file sequential read 2 22441
log file single write 2 20028
control file parallel write 4 37532
events in waitclass Other 5964 59639725
log file parallel write 42187 421869914
rdbms ipc message 756400 7564002913

Also the awr confirm this …
Any help in guiding where to start from would be of great help.

Reply
damirvadas said

January 11, 2018 at 1:46 pm
>Tanel Poder has written an excellent
Your report, what was the parameters run…just cannot get output like you.

Reply
Kumar Ramalingam said

February 17, 2018 at 10:05 pm
Hi Riyaj, thanks for the article. I have been experiencing log file sync waits on one node in my oltp environment for the past few weeks. Upon deep investigation and with Oracle SR, it pointed out to the LGWR behaviour change since 11.2.0.2 where it switches between polling and post/wait. The LGWR waited for a whopping 1 minute before returning to user and this stacked up connections running up to 150 from app servers and a complete freeze of the DB write activities for a minute. the parameter _use_adaptive_log_file_sync has been set to TRUE as default from 11.2.0.3 I think which caused this.
I set it to False but did not bounce the DB because it is a high transaction environment for us. Will do so soon. Yet, I wanted to register my experience so that users of this blog will be benefited.

I have attended your sessions in Hotsos and personally met you and figured that your native is 60KM from mine!! Proud to be associated with you.

Regards
Kumar Ramalingam

Reply
- Riyaj Shamsudeen said
  
  March 8, 2018 at 5:53 pm
  Thanks for reading the blog and your emails. As we discussed offline, you found the root cause: An intermittent hardware issue.
  
  Reply
log_file_sync – odba.eu said

June 24, 2019 at 2:54 pm
[…] Thank you Riyaj Shamsudeen for your good post on resolving log_file_sync wait events.Here is the article: https://orainternals.wordpress.com/2008/07/07/tuning-log-file-sync-wait-events/ […]

Reply
marco.lopez said

November 10, 2020 at 4:27 pm
Hi, I found this note. I’m wondering if this apply for 12.2.0
Any update on this.
Thanks in advance

Reply
Log File Sync Wait Event In Oracle 11g Login Thread - logininfos.com said

September 3, 2022 at 9:42 pm
[…] » Visit Now Jul 07, 2008 · In a nutshell, after posting LGWR to write, user or background processes waits for LGWR to signal back with 1 sec timeout. User process charges this wait time as ‘log file sync’ event. In the prior section, ‘log file sync’ waits starts at step 2 after semctl call and completes after step 5 above. Root causes of ‘log file sync’ waits […]

Reply