9. High Availability for PCC

This section describes different recommended high availability setups and what is required in terms of hardware and software. Other options are available but it is recommended that expertise is consulted before a different deployment option is selected.

Software and Network Requirements

The PCC solution manages availability in different ways. [EZ] and [DR] manage availability by running several of them, while [CZ] requires a stand-by machine to fail over to. This chapter describes what is required for high availability and the possible consequence of not having full redundancy.

Network

 In order to have full redundancy, the network needs to be equipped with redundant switches and all nodes in the solution need to have two network cards each. This is needed in order to be able to manage a switch- or network card failure.


Couchbase

For information on high availability for Couchbase cluster, see CB v 6.0 or CB v 5.5


MySQL Cluster

MySQL Cluster is a distributed shared-nothing database. The database consists of a number of nodes running on COTS hardware. Shared-nothing means that a node does not share CPU, Disk, or RAM with another node. A server failure will then be isolated and only the node running on that server will be affected. Data can be stored in-memory (realtime) or on disk (for large data sets). In any case, data is redo-logged and checkpointed to disk. MySQL Cluster supports transactions and indexes (ordered and hash).

Cluster Software

[CZ] requires a backup machine which has the exact same software installed as the software installed on the primary machine. If the primary [CZ] node goes down, a cluster software will trigger a failover to the backup machine which will take on the role of [CZ]. Cluster Software, with the ability to monitor, stop and start processes, is therefore required.

Network Partitioning 

Network partitioning occurs when a cluster is split into two equal parts, where each part has the same number of nodes.

Installing [CZ] and [EZ]


Typically each partition is mirrored for redundancy. Different databases use different ways to mirror the partitions. MySQL Cluster uses synchronous replication to keep the mirrored partitions in sync, thus each mirror is always guaranteed to have the same data.

Many other databases use asynchronous replication to mirror the partitions, but the inherent problem with asynchronous replication is that data loss can happen (data has been modified on partition A that has not yet been replicated to partition B, and then partition A crashes). The network partitioning problem is that after a split, each partition can be modified and those modifications will never be synchronized with the mirrored partition. Thus, the partitions are drifting apart, and data is inconsistent. In MySQL Cluster the problem with network partitioning is prevented by using an arbitrator.

When the data nodes start up they elect an arbitrator. The data nodes will have to ask the arbitrator for advice if the data nodes would lose contact with each other so that two even splits/partitions are created, The arbitrator informs each split/partition if it is okay for that partition to continue to operate. A negative reply from the arbitrator forces the partition to shutdown. If a partition cannot reach the arbitrator the partition will shutdown.

By default, one management server is elected as the arbitrator. For redundancy it is recommended to have two management servers: if the elected arbitrator fails, then the data nodes will elect the redundant management server to become the arbitrator.

To furthermore minimize the risk for network partitioning it is recommended to have redundant network switches and to equip each server with two network interface cards (NICs). A bond is created from the two NICs, and one network cable is attached to one switch, and the second network cable to the other switch. In MySQL Cluster, if the cluster is split into two unequal parts, the data nodes will make a majority win-decision without involving the arbitrator.

Redis

For information on high availability for Amazon ElastiCache, see https://aws.amazon.com/documentation/elasticache/.