VOTING DISK


VOTING DISK:-

Voting Disk is a file that resides in the shared storage area and must be accessible by all nodes in the cluster, and it is used for:

  • Voting disk stores the cluster membership information, such as which RAC instances are members of a cluster.
  • Handles the situation to avoid split brain syndrome.
  • Decide cluster ownership of nodes in case of split brain syndrome.

 contains and manages information of all node membership: All nodes in the cluster register their Heart-Beatinformation in the voting disk, so as to confirm that they are all operational. The CSS (Cluster Synchronization Service) daemonin the clusterware maintains the heart beat of all nodes to the voting disk.

In Oracle Clusterware:

CSSD processes (Cluster Services Synchronization Daemon) monitor the health of RAC nodes employing two distinct heart beats: Network heart beat and Disk heart beat.  If a node does not send as network heartbeat for <miscount>(timein seconds),Then the node is evicted from the cluster.

If disk heartbeat(votingdisk) is not updated in <IO timeout>, then node is evicted from the cluster

###The node eviction process is reported as Oracle error ORA-29740 in the alert log and LMON trace files.A node is evicted in the above situation because to avoid split_brain syndrome.

Miscount(MC) in oracle RAC:-

The cluster synchronization service (CSS) on RAC has miscount parameter.This value

Represents maximum timein seconds,that a network heartbeat can be missed before entering into

A cluster reconfiguration to evict the node .The default value is 30 seconds (Linux60 seconds in 10g, 30secs in 11g)

Healthy nodes will have continuous network and disk heartbeats exchanged between the nodes. Break in heart beat indicates a possible error scenario.

  1. Network heart beat is successful, but disk heart beat is missed.
  2. Disk heart beat is successful, but network heart beat is missed.
  3. Both heart beats failed.

 

Split Brain Syndrome:-

 

In a Oracle RAC environment all the instances/servers communicate with each other using high-speed interconnects on the private network. This private network interface or interconnect are redundant and are only used for inter-instance oracle data block transfers. Now talking about split-brain concept with respect to oracle RAC systems, it occurs when the instance members in a RAC fail to ping/connect to each other via this private interconnect, but the servers are all physically up and running and the database instance on each of these servers is also running. These individual nodes are running fine and can conceptually accept user connections and work independently. So basically due to lack of communication the instance thinks that the other instance that it is not able to connect is down and it needs to do something about the situation. The problem is if we leave these instances running, the samee block might get read, updated in these individual instances and there would be data integrity issue, as the blocks changed in one instance, will not be locked and could be over-written by another instance. Oracle has efficiently implemented check for the split brain syndrome.

I/O Fencing:-

There will be some situation where the leftover write operations from failed database instances (The cluster function failed on the nodes, but the nodes are still running at OS level) reach the storage system after the recovery process starts. Since these write operations are no longer in the proper serial order, they can damage the consistency of the data stored data.

Therefore when a cluster node fails, the failed node needs to be fenced off from all the shared disk devices or disk groups. This methodology is called I/O fencing or failure fencing.

The two main functions of I/O fencing are to prevent updates by failed instances, and to detect failure and prevent split-brain syndrome in the cluster. Cluster volume Manager (in association with the shared storage unit) and cluster file system play a significant role in preventing the failed nodes from accessing shared devices

Oracle uses algorithms common to STONITH (shoot the other node in the Head) implementations to determine what nodes need to get fenced. When a node is alerted that it is being “fenced” it uses suicide to carry out the order

 

Here in below example, node1 cannot use the Interconnect anymore. It can still access the Voting Disk, though. Nodes 2 and 3 see their heartbeats still but no longer node1, which is indicated by the green Vs and red fs in the picture. The node with the network problem gets evicted by placing the Poison Pill into the Voting File for node1. CSSD of node1 will commit suicide now and leave the cluster:

There isn’t really any useful data kept in the voting disk. So, if you lose voting disks, you can simply add them back without losing any data. But, of course, losing voting disks can lead to node reboots. If you lose all voting disks, then you will have to keep the CRS daemons down, then only you can add the voting disks.

Voting disk is a mean to know that which nodes are still a part of the cluster. If the nodes are not able to follow a majority rule , they would start getting rebooted all the time. And if this happens, you have no choice but to attempt recovery of the voting disk.

Example 1:-

Let’s assume that instances A & instance B is unable to communicate with each other. Now, this might be due the fact that the other instance is actually down or because there is some problem with the interconnect communication between the instances.

A split-brain occurs when the instances cannot speak to each other due to interconnect communication problem and each instance happens to think that it’s the only surviving instance and starts updating the database. This obviously is problematic, isn’t it?

To avoid split brain, both instances are required to update the voting disk periodically.

Now consider the earlier case from the perspective of one of the nodes:

case 1) instance B is actually down:- Here, looking from instance A’s perspective; there is no communication over the interconnect with instance B and also instance B is not updating the voting disk (‘coz it’s down). Instance A can safely assume that instance B is down and it can go on providing the services and updating the database.

Case 2) the problem is with interconnect communication:- Here, looking from instance A’s perspective; there is no communication over the interconnect with instance B. However it can see that instance B is updating the voting disk; which means instance B is not actually down. This state is true from instance B’s perspective; it can see that instance A is updating the voting disk but it cannot speak to instance A over the interconnect. At this point, both instances rush and try to lock down the voting disk and whoever gets the lock ejects the other one thus avoiding split-brain syndrome.

That’s the theory behind using a voting disk.

Example 2:

The split brain concepts can become more complicated in large RAC setups. For example there are 10 RAC nodes in a cluster. And say 4 nodes are not able to communicate with the other 6. So there are 2 groups formed in this 10 node RAC cluster (one group of 4 nodes and other of 6 nodes). Now the nodes will quickly try to affirm their membership by locking control file, then the node that lock the control file will try to check the votes of the other nodes. The group with the most number of active nodes gets the preference and the others are evicted.

When we see that the node is evicted, usually oracle RAC will reboot that node and try to do a cluster reconfiguration to include back the evicted node.

1.) Obtaining voting disk information –

$ crsctl query cssvotedisk

2.) Adding Voting Disks

First shut down Oracle Clusterware on all nodes, then use the following commands as the root user.

# crsctl add css [path of voting disk]

3.) Removing a voting disk:

First shut down Oracle Clusterware on all nodes, then use the following commands as the root user.

# crsctl delete css [path of voting disk]

Do not use -force option for adding or removing voting disk while the Oracle Clusterware stack is active, it can corrupt cluster configuration. You can use it when cluster is down and can modify the voting disk configuration using either of these commands without interacting with active Oracle Clusterware daemons.

4.) Backing up Voting Disks

Perform backup operation whenever there is change in the configuration like add/delete of new nodes or add/delete of voting disks.

$ dd if=current_voting_disk of=backup_file_name

If your voting disk is stored on a raw device, specify the device name –

$ dd if=/dev/sdd1 of=/tmp/vd1_.dmp

5.) Recovering Voting Disks

A Bad voting disk can be recovered using a backup copy.

$ dd if=backup_file_name of=current_voting_disk

#########################################################

 

An odd number of voting disks is required for proper clusterware configuration.

A node must be able to strictly access more than half of the voting disks at any time.

So, in order to tolerate a failure of n voting disks, there must be at least 2n+1 configured. (n=1 means 3 voting disks).

You can configure up to 31 voting disks, providing protection against 15 simultaneous disk failures.

Advertisements

3 thoughts on “VOTING DISK

  1. Hi Rajat,

    Excellent article.

    We had a recent incident involving an interconnect failure between a HA Oracle RAC set of two P-Series Servers which resulted with a Split Brain scenario. One of the two Oracle RAC instances was evicted as expected and confirmed on the logs. However the non evicted Oracle RAC Server’s CPU dropped to 0% utilization and the Service could not continue. A re-boot of the Server was required to re-establish a working Oracle RAC instance to continue the Service.

    Any ideas or scenarios you can suggest of that would explain why the non evicted Oracle RAC Server CPU effectively hung?

    Regards,
    JM

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s