Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

What’s in a voting disk?

Posted by Riyaj Shamsudeen on October 29, 2010

Introduction

In RAC, CSSD processes (Cluster Services Synchronization Daemon) monitor the health of RAC nodes employing two distinct heart beats: Network heart beat and Disk heart beat. Healthy nodes will have continuous network and disk heartbeats exchanged between the nodes. Break in heart beat indicates a possible error scenario. There are few different scenarios possible with missing heart beats:

  1. Network heart beat is successful, but disk heart beat is missed.
  2. Disk heart beat is successful, but network heart beat is missed.
  3. Both heart beats failed.

In addition, with numerous nodes, there are other possible scenarios too. Few possible scenarios:

  1. Nodes have split in to N sets of nodes, communicating within the set, but not with members in other set.
  2. Just one node is unhealthy.

Nodes with quorum will maintain active membership of the cluster and other node(s) will be fenced/rebooted. I can’t discuss all possible scenarios in a blog entry, so we will discuss a simplistic 2-node single voting disk alone here.
Voting disks are used to monitor the disk heart beats. It is preferable to have at least 3 voting disks or odd number of voting disks greater than or equal to 3.

CSSD is a multi-threaded process

Voting disks are shared between the nodes and should be visible from all nodes, stating the obvious. CSSD process is a multi-threaded process and a thread of the CSSD process monitors the disk heart beat. The disk HB (Heart Beat) thread is scheduled approximately every second and that thread verifies the disk heart beat from all active nodes in the cluster. Also, another thread of CSSD monitors the network heart beat. Pstack (Solaris) of CSSD process would show the threads of CSSD process.

Details: write calls

CSSD process in each RAC node maintains it heart beat in a block of size 1 OS block, in the voting disk. In Solaris VM that I was testing, OS block size is 512 bytes (We will discuss just Solaris alone in this post). In addition to maintaining its own disk block, CSSD processes also monitors the disk blocks maintained by the CSSD processes running in other cluster nodes.

CSSD process writes a 512 block to the voting disk in a specific offset. The written block has a header area with the node name. For example, in the pwrite call below, node name of solrac1 is in the first few bytes of the block. Third line printed below is keeping track of heart beat counter. This heart beat counter looks similar to the SCN mechanism, “0F 9D02″ is the sequence number for the first write.

Also, Notice that the offset for the pwrite call 0×04002400. Node solrac1 writes a 512 byte block starting at the offset 0×04002400.

"/14:	pwrite(256, 0x019F4A00, 512, 0x04002400)	= 512"
"/14:	   e t o V02\ 0\ 0\ 00104\ v02\ 0\ 0\ 0\ 0 s o l r a c 1\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0 c JD2\ n 0F9D02 \ 003\ 0\ 0\ 0"  <-- Counter
"/14:	  \ 00303\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  F0 xBE L e01EC\ 0 e ;\ 0\ 0\ 0\ 0\ 0\ 003\ 0\ 0\ 0 { =BE L1C87A8 L\ 0\ 001\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"

Next write call from the local node CSSD process shows an increase in the counter. Value of that counter in line 3 went up from “0F 9D02″ -> “10 9D02″. BTW, I am removing few lines printed with to improve readability. So, the counter is incremented for every heart beat.

"/14:	pwrite(256, 0x019F4A00, 512, 0x04002400)	= 512"
"/14:	   e t o V02\ 0\ 0\ 00104\ v02\ 0\ 0\ 0\ 0 s o l r a c 1\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0 c JD2\ n109D02\ 003\ 0\ 0\ 0" <-- counter
"/14:	  \ 00303\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
...
"/14:	  F1 xBE L N05EC\ 0 f ;\ 0\ 0\ 0\ 0\ 0\ 003\ 0\ 0\ 0 { =BE L1C87A8 L\ 0\ 001\ 0"
...
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"

Details: read calls

After the successful write, CSSD process also reads the blocks maintained by CSSD processes from other nodes. For example, in the pread output below, CSSD process in the solrac1 node is reading the block maintained by CSSD@solrac2 node using a pread call. Third line in the listing below has a sequence value of “FB 9702″ for the node solrac2. A different sequence number is used by each node.

Also, Notice the offset for the pread call is 0×04002200 and that offset is different between pread and pwrite calls. Essentially, Node solrac2 is writing its heart beat starting at offset 0×04002200 and solrac1 is writing its heart beat at an offset of 0×04002400. The difference between these two offset values are exactly 0×200, which is 512 bytes.

In a nutshell, node Solrac2 maintains the heart beat disk block at an offset of 0×04002200 and the node solrac1 maintains the heart beat in the next 512 byte disk block.

"/14:	pread(256, 0x019F5000, 512, 0x04002200)		= 512"
"/14:	   e t o V01\ 0\ 0\ 00104\ v02\ 0\ 0\ 0\ 0 s o l r a c 2\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0 c JD2\ nFB9702\ 003\ 0\ 0\ 0" <-- counter
"/14:	  \ 00303\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  F0 xBE L KD6E9\ 0 /\ t\ 0\ 0\ 0\ 0\ 0\ 003\ 0\ 0\ 0 m oBE L1C87A8 L\ 0\ 001\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"

Next read from the offset 0×04002200 shows that solrac2 also increased the counter from “FB 9702″ to “FC 9702″.

"/14:	pread(256, 0x019F5000, 512, 0x04002200)		= 512"
"/14:	   e t o V01\ 0\ 0\ 00104\ v02\ 0\ 0\ 0\ 0 s o l r a c 2\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0 c JD2\ nFC9702\ 003\ 0\ 0\ 0" <-- counter
"/14:	  \ 00303\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"
...
"/14:	  F1 xBE L 5DAE9\ 0 0\ t\ 0\ 0\ 0\ 0\ 0\ 003\ 0\ 0\ 0 m oBE L1C87A8 L\ 0\ 001\ 0"
...
"/14:	  \ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0"

Summary

In essence, disk heart beat is maintained in the voting disk by the CSSD processes. If the disk block is not updated in a short timeout period, that node is considered unhealthy and may be rebooted depending upon quorum of that node(or Shot in the head) to avoid split brain situation.

As evidenced in this blog, there isn’t really any useful data kept in the voting disk. So, if you lose voting disks, you can simply add them back without losing any data. But, of course, losing voting disks can lead to node reboots. If you lose all voting disks, then you will have to keep the CRS daemons down, then only you can add the voting disks.

This blog entry also begs the question about performance. How many I/O calls are performed against these voting disks? As the number of nodes increases, I/O also increases. For 2 node RAC, there are 2 reads (CSSD also reads another block, not sure why though) and 2 writes per second. With 6 nodes in the cluster, it will be 35 reads and 6 writes per second. From 11g onwards, you could keep voting disks in ASM.

If you wish to read this blog entry in more conventional format read orainternals_voting_disks_internals.pdf

11 Responses to “What’s in a voting disk?”

  1. Greg said

    Hi,
    very informative as always .
    Thank You.
    GregG

  2. bernard polarski said

    Many thanks for sharing this. The hazards of assignments put me away from RAC so articles like this are important to keep in touch with a feature I will certainly meet again.

  3. […] It’s always very informative and exciting to know about internals of RAC Stuff like what actually is maintained in the Voting Disk . Riyaj has it here. […]

  4. Anand said

    Hi Sir,

    I had read this article the day you had posted it, but forgot to ask who to check the pread/pwrite contents of the voting disk?

    Regards,
    Anand

  5. Aman.... said

    Riaz,

    Can you shed some more light on the part where you mentioned that teh nodes are divided into groups?

    Aman….

  6. Yuri Pudovchenko said

    Hi, Ryaj,

    Excellent post !
    but what occurs if

    – Network heart beat is successful, but disk heart beat is missed.
    – Disk heart beat is successful, but network heart beat is missed.

    I long time seek answer for these questions.

    Thanks !

    • Hello Yuri
      Note 294430.1 discusses in detail. I will summarize to answer your question.
      css misscount parameter determines network heartbeat, defaults to 30 seconds. disk timeout is 200 seconds.
      If network heartbeat is missed after a timeout of 30 seconds, reboot is initiated (approximately, it is 34 seconds). It doesn’t matter what happens with disk heartbeat.
      For the disk heartbeat (voting disk): If the heartbeat did not complete in 200 seconds then the node will be rebooted.
      If the disk heartbeat completes under 200 seconds,then the reboot will not happen as long as network heartbeat is successful.
      This is little bit different at cluster reconfiguration time. Also, if there is a vendor clusterware in play, then misscount is set to 600 seconds.

      Cheers
      Riyaj

  7. PC said

    Hi Riyaj,

    You mention that “It is preferable to have at least 3 voting disks or odd number of voting disks greater than or equal to 3.” So is it correct to say that having odd number of VD is NOT mandatory?

    Also, pls help understand as to what would happen if 1 of the 3 VD is lost (but there is no other failure).

    Thanks

  8. […] https://orainternals.wordpress.com/2010/10/29/whats-in-a-voting-disk/ […]

  9. […] is Split Brain in Oracle Clusterware and Real Application Cluster (文档 ID 1425586.1) https://orainternals.wordpress.com/2010/10/29/whats-in-a-voting-disk/ […]

Leave a reply to Log Buffer #205, A Carnival of the Vanities for DBAs | The Pythian Blog Cancel reply