Deployment Options for High Availability

This section describes MediationZone deployment options and deals with HA and redundancy.

In all deployment options, CZ communicates with EZ, and both CZ and EZ communicates with DZ. This communication is omitted for clarity. Arrows indicate HA-relevant communication, such as replication or active/passive relationships.

Default Deployment

This deployment does not provide HA.

In this deployment, EZ, CZ and DZ contain one node. They may be co-located on the same (virtual or physical) machine, or may be located on separate machines.

This deployment type is suitable for offline/batch applications where shorter periods of downtime are not mission critical, and reduces the installation footprint to a minimum.

Backups of the file system must be taken regularly to ensure a failed system can be restored. This is the responsibility of the customer.

Oracle

For default deployments using Oracle as MZDB, file-system level backup is not recommended since it can result in a corrupt database after failover. Oracle-specific backup routines should be used to ensure consistent MZDB backup in such cases, e.g. RMAN or Data Guard.

Execution Zone Local High Availability

In this deployment, EZ contains N = 2 or more nodes, all of which process data in an active-active configuration. HA is achieved by horizontal scaling. If a node fails, its load is taken over by a standby instance on another node.

It protects against:

Single Component failure in EZ.

If N > 2, it also protects against:

Multiple Component failure in EZ. If multiple nodes fail, it is still possible to provide service

although at a limited capacity.
If the nodes are deployed using anti-affinity rules so they run in separate racks, it also protects against:

Single Rack failure in EZ.

Dimensioning should be done in a way that all traffic can be handled in case of a single failure. As an example, if 4 EZ nodes are required to handle all traffic, then there should be 5 EZ nodes. This gives a basic redundancy of N+1 for the EZ. This is modified accordingly for higher redundancy levels, N+2, N+3, etc.

Data Zone Local High Availability

In this deployment, DZ storage is highly available across a single site, by having N = 2 or more replicas of all data.

It protects against:

Single Component failure in DZ.

If N > 2, it also protects against:

Multiple Component failure in EZ. If multiple nodes fail, it is still possible to provide service

although at a limited capacity.
If the nodes are deployed using anti-affinity rules so they run in separate racks, it also protects against:

Single Rack failure in DZ.

For block storage, this can be achieved in many different ways:

Files can be stored on a detachable block storage volume that provides redundancy, e.g.
OpenStack Cinder or Amazon EBS.
Files can be stored on a SAN or NAS (e.g. NFS) that provides redundancy.
A clustered/distributed file system, e.g. Ceph, GlusterFS, or Amazon EFS, can be used for files.
While not the focus of this document, note that the I/O performance of the above storage types can vary significantly. This should be considered for each deployment.

Derby

For deployments using Derby as MZDB, NFS storage is not supported. A SAN or a clustered file system should be used instead. Using DRDB is another option.

Oracle

For deployments using Oracle as MZDB, file-system level replication is not recommended since it can result in a corrupt database after failover. Oracle-specific HA mechanisms, such as Data Guard and/or Oracle RAC, should be used to ensure consistent MZDB replication in such cases.

Couchbase

For deployments using CouchBase persistent storage, local HA is achieved using N = 3 or more instances with a replication factor of N = 1 or more (meaning 1 master + N replicas). The replicas should not share storage. The instances should run on different hosts.v

Kafka

For deployments using Kafka persistent storage, local HA is achieved N = 3 or more brokers with a replication factor of N = 2 or more. The replicas should not share storage. The instances should run on different hosts.

Control Zone Local High Availability

In this deployment, CZ contains N = 2 nodes, in an active-standby configuration.

It protects against:

Single Component Failure in CZ. If the active node fails, a failover procedure to the standby

node should automatically be initiated by external cluster management software.
If the nodes are deployed using anti-affinity rules so they run in separate racks, it also protects against:

Single Rack failure in EZ.

Using this deployment requires DZ to be deployed with at least local high availability.
After a failover, the standby CZ instance will become the new active instance, initializing from DZ.

Execution Zone Multiple-Site High Availability

This deployment is similar to Execution Zone Local High Availability. The difference is that anti-affinity rules are used to ensure that CZ nodes run on N = 2 or more sites (data centers, availability zones).

It protects against:

Single Site failure in EZ
Hypervisor failure in EZ. Even if an entire site’s Hypervisor fails, the other sites’ Hypervisors
remain functional.
If N > 2, it also protects against:

Multiple Site failure in EZ. If multiple nodes fail, it is still possible to provide service although at

a limited capacity.
All other considerations for Execution Zone Local Site High Availability apply.

For dimensioning, each site should be able to handle normal traffic in case of a Single Site failure. As an example, this gives N+N redundancy during normal operation with 2 sites.

Note that this deployment may introduce latency in inter-site communication. Whether this is acceptable or not depends on the specific use case. For example, batch FTP file collection is not very sensitive to latency, while synchronous JDBC lookups in a Usage Management scenario are quite sensitive to latency exceeding a few milliseconds. This must be taken into consideration when designing a solution.

For real-time scenarios with synchronous inter-site communication, the following general guidelines apply:

Connectivity between the sites runs over dedicated fiber.
Connectivity between the sites is fully redundant.
Connectivity between the sites is Layer 2, without any L3 routing.
Connectivity between the sites has a round-trip latency of no more than 3 ms.
Connectivity between the sites has a bandwidth of at least 1 Gbps.

Data Zone Multiple-Site High Availability

This deployment is similar to Data Zone Local High Availability. The difference is that anti-affinity rules are used to ensure that DZ nodes run on N = 2 or more different sites (data centers, availability zones). Each site runs N = 1 or more DZ replicas.

It protects against:

Single Site failure in DZ
Hypervisor failure in DZ. Even if an entire site’s Hypervisor fails, the other sites’ Hypervisors
remain functional.
If N > 2, it also protects against:

Multiple Site failure in DZ. If multiple nodes fail, it is still possible to provide service although at

a limited capacity.

All other considerations for DZ Local Site High Availability apply. It is completely feasible to combine this deployment with DZ Local Site High Availability, having N=2 replicas per site.

For dimensioning, each site should be able to handle normal traffic in case of a Single Site failure.

The same latency considerations as for Execution Zone Multiple-Site High Availability apply. In general, it is recommended that EZ nodes from one site do not directly access DZ nodes at another site, except for data replication purposes.

Depending on the latency between the sites, replication should be carefully chosen:

Low inter-site latency, <= 3ms: Use synchronous replication for “no data loss”-type data. Use
asynchronous replication for all other data.
Higher inter-site latency, > 3ms: Use asynchronous data replication.
When asynchronous replication is used, there is always a risk of minor data loss when a failover between sites is executed. This depends very much on the use case and must be analyzed per individual deployment.

Couchbase

For CouchBase inter-site replication, XDCR should be used.

Control Zone Multiple-Site High Availability

This deployment is similar to Control Zone Local High Availability. The difference is that anti-affinity rules are used to ensure that CZ nodes run on N = 2 sites (data centers, availability zones). Each site runs N = 1 or more DZ replicas.

It protects against:

Single Site failure in CZ

All other considerations for Control Zone Local Site High Availability apply.

The same latency considerations as for Execution Zone Multiple-Site High Availability apply. In general, it is recommended that CZ nodes from one site do not directly access DZ nodes at another site, except for data replication purposes.

Variant: Independent Control Zones

A variant of the above scenario is using two independent CZ, which share parts of DZ, specifically CouchBase. CouchBase replication is used to ensure HA for this instance. No other parts of DZ, e.g. MZDB and CZ file systems, are shared.

This gives zero-time failover in case of Diameter use cases, while session state replication allows fail- over from one site to the other. A requirement for this is that sticky sessions are used when accessing DZ.

Regional High Availability

This deployment is nearly identical to Multiple-Site High Availability for all zones, except that the sites are located in different regions, i.e. geographically separate. It is often referred to as “geographic disaster recovery”. This deployment type can be seen as an addition to a local HA deployment (one live site and one disaster recovery site), or a multi-site HA deployment (two or more live sites and one disaster recovery site).

The inter-site latency in this case is assumed to be > 3 ms, making synchronous replication infeasible.

It is not recommended that EZ/DZ nodes in one region directly contact CZ in another region.

It protects against:

Regional failure in all zone types