Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

All about RAC and MTU with a video

Posted by Riyaj Shamsudeen on May 22, 2012

Let’s first discuss how RAC traffic works before continuing. Environment for the discussion is: 2 node cluster with 8K database block size, UDP protocol is used for cache fusion. (BTW, UDP and RDS protocols are supported in UNIX platform; whereas Windows uses TCP protocol).

UDP protocol, fragmentation, and assembly

UDP Protocol is an higher level protocol stack, and it is implemented over IP Protocol ( UDP/IP). Cache Fusion uses UDP protocol to send packets over the wire (Exadata uses RDS protocol though).

MTU defines the Maximum Transfer Unit of an IP packet. Let us consider an example of MTU set to 1500 in a network interface. One 8K block transfer can not be performed with just one IP packet  as the IP packet size (1500 bytes) is less than 8K. So, one transfer of UDP packet of 8K size is fragmented to 6 IP packets and sent over the wire. In the receiving side, those 6 packets are reassembled to create one UDP buffer of size 8K. After the assembly, that UDP buffer is delivered to an UDP port of a UNIX process. Usually, a foreground process will listen on that port to receive the UDP buffer.

Consider what happens If MTU is set to 9000 in the network interface:  Then 8K buffer can be transmitted over the wire with just one IP packet. There is no need for fragmentation or reassembly with MTU=9000 as long as the block size is less than 8K. MTU=9000 is also known as jumbo frame configuration.  ( But, if the database block size is greater than jumbo frame then fragmentation and reassembly is still required. For example, for 32KB size, with MTU=9000,  there will three 9K IP packets  and one 5K IP packet to be transmitted).

Fragmentation and reassembly is performed at OS Kernel layer level and hence it is the responsibility of Kernel and the stack below to complete the fragmentation and assembly. Oracle code simply calls the send and receive system calls, passes the buffers to populate.

Few LMS system calls in Solaris platform:

0.6178  0.0001 sendmsg(30, 0xFFFFFFFF7FFF7060, 32768)          = 8328
0.6183  0.0004 sendmsg(30, 0xFFFFFFFF7FFFABE0, 32768)          = 8328
0.6187  0.0001 sendmsg(36, 0xFFFFFFFF7FFFBA10, 32768)          = 144
...
0.7241  0.0001 recvmsg(27, 0xFFFFFFFF7FFF9A10, 32768)          = 192
0.7243  0.0001 recvmsg(27, 0xFFFFFFFF7FFF9A10, 32768)          = 192

UDP vs TCP

If you talk to a network admin about use of UDP for cache fusion, usually, there will be few eyebrows raised about the use of UDP. From RAC point of view, UDP is the right choice over TCP for cache fusion traffic. With TCP/IP, for every packet transfer has overhead, connection need to be setup, packet sent, and the process must wait for TCP Acknowledgement before considering the packet send as complete. In a busy RAC systems, we are talking about 2-3 milli-seconds for packet transfer and with TCP/IP, we probably may not be able to achieve that level of performance. With UDP, packet transfer is considered complete, as soon as packet is sent and error handling is done by Oracle code itself. As you know, reliable network is a key to RAC stability, if much of packets (closer to 100%) are sent without any packet drops, UDP is a good choice over TCP/IP for performance reasons.

If there are reassembly failures, then it is a function of unreliable network or kernel or something else, but nothing to do with the choice of UDP protocol itself. Of course, RDS is better than UDP as the error handling is offloaded to the fabric, but usually require, infiniband fabric for a proper RDS setup. For that matter, VPN connections use UDP protocol too.

IP identification?

In a busy system, there will be thousands of IP packets traveling in the interface, in a given second. So, obviously, there will be many IP packets from different UDP buffers received by the interface. Also, because these ethernet frames can be delivered in any order, how does Kernel know how to assemble them properly? More critically, how does the kernel know that 6 IP packets from one UDP buffer belongs together and the order of those IP packets?

Each of these IP packet has an IP identification and fragment offset. Review the wireshark files uploaded in this blog entry, you will see that all 6 IP packets will have the same IP identification. That ID and the fragment offset is used by the kernel to assemble the IP packets together to create UDP buffer.

Identification: 0x533e (21310)
..
Fragment offset: 0

Reassembly failures

What happens if an IP packet is lost, assuming MTU=1500 bytes?

From the wireshark files with mtu1500, you will see that each of the packet have a Fragment offset. That fragment offset and IP identification is used to reassemble the IP packets to create 8K UDP buffer. Consider that there are 6 puzzle pieces, each puzzle piece with markings, and Kernel uses those markings( offset and IP ID) to reassemble the packets. Let’s consider the case, one of 6 packet never arrived, then the kernel threads will keep those 5 IP packets in memory for 64 seconds( Linux kernel parameter ipfrag_time controls that time) before declaring reassembly failure. Without receiving the missing IP packet, kernel can not reassemble the UDP buffer, and so, reassembly failure is declared.

Oracle foreground process will wait for 30 seconds (it used to be 300 seconds or so in older version of RAC) and if the packet is not arrived within that timeout period, FG process will declare a ‘GC lost packet’ and re-request the block. Of course, kernel memory allocated for IP fragmentation and assembly is constrained by Kernel parameter ipfrag_high_thres and ipfrag_low_thres and lower values for these kernel parameters can lead to reassembly failures too (and that’s why it is important to follow all best practices from RAC installation guides).

BTW, there are few other reasons for ‘gc lost packets’ too. High CPU usage also can lead to ‘gc lost packets’ failures too, as the process may not have enough cpu time to drain the buffers, network buffers allocated for that process becomes full, and so, kernel will drop incoming packets.

It is probably better to explain these concepts visually. So, I created a video. When you watch this video, notice that there is HD button on the top of the video. Play this in HD mode so that you will have better learning experience.

You can get the presentation file from the video here: MTU

Wireshark files explained in the video can be reviewed here:
wireshark_1500mtu
wireshark_9000mtu

BTW, when you review the video, you will see that I had little bit trouble identifying the packet in the wireshark output initially. I understood the reason for not seeing the packets filled with DEADBEEF characters. Why do you think I didn’t see the packets initially?

Also, looks like, video quality is not that great when embedded. If you want actual mp4 files, let me know, may be I can upload to a drop box and let you download, email me.

21 Responses to “All about RAC and MTU with a video”

  1. Rohit N said

    Hi Riyaj,

    The content of the video is amazing … you really have dived deep.

    Although as you mention the video quality is not the best. Could you upload the actual HD mp4 files somewhere are kindly share the links??

    Rohit N.

    • Hello Rohit
      Problem is that wordpress doesn’t allow external links to be posted. I am trying to find a workaround.
      If not, I will email you the drop box later..

      Cheers
      Riyaj

  2. Mohammed Yousuf said

    Hi,

    As requested you have posted the details about MTU,

    Great artical and now the things are clear

    Thank you so much for the details

    Thanks

    Yousuf,

  3. Mohammed said

    Thanks for the details,

    Just to ask one question, If we go with MTU=9000 then

    Is this need to be modify MTU for all RAC interfaces like (Private, Public, Virtual) or only for private interconnect interface?

    Please suggest.

    Thanks for your help.

    Mohammed.

  4. anonymous said

    Excellent discussion! One small correction is that the TTL field in the IP header isn’t related to the fragment reassembly at all.
    The parameter which controls the duration a specific fragment remains in memory for reassembly is ipfrag_time (net.ipv4.ipfrag_time) and is not exposed, nor manipulatable externally.
    Thanks again for the great video!

  5. […] Riyaj Shamsudeen has a great post about RAC and MTU studded with a video. […]

  6. Tirumal Rao said

    Excellent article and very clear explanation in the video, can you please share the video link if it deployed somewhere.

    Very well explained.

    Cheers
    Tirumal

  7. Mohammad Rasheeduddin said

    Hi Riyaz Shamsuddin,

    Thanks for the excellent discussion in very simple words.

    But I’m not able to look see this video …

    Could you please email me the drop box….

    Thanks again…

  8. Hi Riyaj Shamsudeen, thanks for share this information, it helps us a lot!

    Warm regards.
    Alex Escalante

  9. xiaojian Liu said

    Hi,Riyaj:

    You share this article gave me a lot of inspiration,and I still has a question about the udp.
    In case if the receiver overflow occurs udp package, the receiving end if can request the sender to re-send the udp package in a rac ?

    I enthusiastically look forward to your reply!

    Best regards,
    Xiaojian Liu

  10. Ganesh said

    Hi Riyaj

    Very good explanation. Thanks

  11. jcnars said

    Hi,
    Doesn’t the TTL=64 mean 64 hops before the packet is discarded? (each router reducing the ttl count by 1?)

    Thanks

  12. jcnars said

    [root@racnroll2 ~]# lsof -i UDP:43737
    COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
    ora_lms0_ 8286 oracle 18u IPv4 244996 0t0 UDP racnroll2-priv.localdomain:43737
    [root@racnroll2 ~]#

    Could help when process name (such as lms) is not known and UDP port is known.
    Thanks for the excellent blog.

  13. Gregg said

    One could boost this tactic by possibly bettering the likelihood of accomplishment or securing the halt
    loss to lessen the utmost loss.

  14. Eduardo Capistrán Del Angel said

    Great article.

    Just one question. If I have a 8k block size database but 4k max MTU in my internal switch network…

    Should I gain some performance by moving the database block size from 8k to 4k ?

    • No, most likely, not worth the effort. 8K block size is the most widely used block size, and no reason to introduce some unknown variables to gain a fraction of percent CPU.

      • Raja said

        Hi Riyaj,
        Thank you for your time and response. Your article have cleared lot of our doubts. we still have one question…Our Infiniband switch can support max 4K MTU. In such case, Is it advisable to have 9K MTU on Interface end and 4K on switch end? please suggest..

      • Hi,
        Keep MTU the same in that network path to 4k, in your case. If you set interface mtu to 9k and switch mtu to 4k, then the switch and interface will negotiate the mtu down to 4k (path mtu to 4k).

        Ideally, switches have higher mtu. I am surprised that the switch max mtu is 4k. I am guessing it’s not a technical limitation.
        Also, remember that most packets are not full. Depending upon the application profile, 4k is probably just fine.

  15. Verla said

    Oh my goodness! Amazing article dude! Thank you so much, However
    I am experiencing difficulties with your RSS. I don’t
    understand the reason why I can’t subscribe to it. Is there anybody getting similar
    RSS issues? Anybody who knows the answer will you kindly
    respond? Thanks!!

Leave a reply to Tirumal Rao Cancel reply