Managing Akka Cluster Failure

An issue with distributed systems is that it is difficult to distinguish a machine failure from a network issue. Nodes in the system detect that they cannot communicate with other nodes but they have no way to determine if it is safe to remove these unreachable nodes from the cluster. If there is a network split, these nodes are still active and in turn see other nodes as unreachable. If nodes were allowed to remove nodes that they observe as unreachable, the result would be multiple clusters, which is referred to as a split brain scenario. For further information, see http://akka.io/docs/.

In a normal situation where the cluster has no unreachable nodes, you can start and stop Service Context processes that are tagged with the natures given in an Akka cluster configuration. In this case, the started and removed Service Context processes can safely add nodes to and remove nodes from the cluster since all the nodes are reachable and consensus can be reached. In a situation where we have unreachable nodes in the cluster, nodes cannot be added and removed by simply starting and stopping Service Context processes. The cluster must be restored by removing nodes from the cluster until all the nodes in the cluster can communicate with each other again. 

If you are using Conditional Trace or System Insight,  creates Akka cluster clients on the relevant ECs. In the case of 'catastrophic communication failure' between the client and the Akka cluster, the EC client will quarantine the SC cluster. If this occurs, you must restart the entire SC cluster according to the relevant example scenario described below. For further information, see Lifecycle and Failure Recovery Model in https://doc.akka.io/docs/akka/2.4.12/scala/remoting.html.

The solution to this issue is to make use of the command line interface provided in the system, to manage Akka clusters. Using the mzsh akka command, you can restore a cluster which has unreachable nodes that cannot recover from failure. See the example scenarios provided below.

Example Scenario - Restoring a cluster by removing a node

In this scenario, the akka cluster is running on three nodes: sc1, sc2 and sc3, and sc3 cannot be reached by sc1 and sc2.


Example - A cluster with an unreachable node

You detect that sc3 is unreachable using the mzsh akka command:

$mzsh akka cluster-status <akka cluster> <sc in regex>

Example - Checking the status of a cluster

You check the status of the cluster, named cluster in this case, and check on all of the nodes in the cluster using regex ".*":

$mzsh akka cluster-status cluster ".*"

The output lists all of the cluster members that can be reached and their status according to each node. In this case two of the nodes indicate that one of the nodes is unreachable, and both provide the same address for the unreachable node, "akka.tcp://cluster@172.17.0.1:5601".

{
  "sc1": {
    "leader": "akka.tcp://cluster@172.17.0.1:5501",
    "members": [{
      "address": "akka.tcp://cluster@172.17.0.1:5501",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5551",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5601",
      "roles": ["default"]
    }],
    "unreachable": [{
      "address": "akka.tcp://cluster@172.17.0.1:5601",
      "roles": ["default"]
    }],
    "roles": [{
      "name": "default",
      "leader": "akka.tcp://cluster@172.17.0.1:5501"
    }]
  },
"sc2": {
    "leader": "akka.tcp://cluster@172.17.0.1:5501",
    "members": [{
      "address": "akka.tcp://cluster@172.17.0.1:5501",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5551",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5601",
      "roles": ["default"]
    }],
    "unreachable": [{
      "address": "akka.tcp://cluster@172.17.0.1:5601",
      "roles": ["default"]
    }],
    "roles": [{
      "name": "default",
      "leader": "akka.tcp://cluster@172.17.0.1:5501"
    }]
  }
}

At the same time sc3 considers sc1 and sc2 as unreachable:

Example - Checking the status of an akka cluster from an unreachable node

If you check the cluster status from the node which is unreachable:

$mzsh akka cluster-status cluster sc3

The other two nodes are shown as unreachable:

{
  "sc3": {
    "leader": "akka.tcp://cluster@172.17.0.1:5501",
    "members": [{
      "address": "akka.tcp://cluster@172.17.0.1:5501",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5551",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5601",
      "roles": ["default"]
    }],
    "unreachable": [{
      "address": "akka.tcp://cluster@172.17.0.1:5501",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5551",
      "roles": ["default"]
    }],
    "roles": [{
      "name": "default",
      "leader": "akka.tcp://cluster@172.17.0.1:5501"
    }]
  }
}


To resolve this, you can restore one side of this network split so that the cluster no longer has unreachable nodes. In this example, we remove sc3 from the cluster. To do this, you must first ensure that the Service Context process does not fail if you shut down sc3.

When this is established, you can safely tell one of the other nodes in the cluster to remove sc3 using the mzsh akka command:

$mzsh akka down <akka cluster> <sc in regex> <node address>

Example - Telling a node to remove another node from cluster

For this example, you use the following to remove sc3, from one of the reachable nodes:

$mzsh akka down cluster sc1 akka.tcp://cluster@172.17.0.1:5601

The output is:

{
  "sc1": "OK"
}

By running the akka command to check the cluster status again, you can see that sc3 is no longer part of the cluster:

$mzsh akka cluster-status <akka cluster> <sc in regex>

Example - Checking the status of an akka cluster

By running this command:

$mzsh akka cluster-status cluster ".*"

You can see that sc3 is no longer included in the cluster, and there are no unreachable nodes in the cluster:

{
  "sc1": {
    "leader": "akka.tcp://cluster@172.17.0.1:5501",
    "members": [{
      "address": "akka.tcp://cluster@172.17.0.1:5501",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5551",
      "roles": ["default"]
    }],
    "unreachable": [],
    "roles": [{
      "name": "default",
      "leader": "akka.tcp://cluster@172.17.0.1:5501"
    }]
  },
  "sc2": {
    "leader": "akka.tcp://cluster@172.17.0.1:5501",
    "members": [{
      "address": "akka.tcp://cluster@172.17.0.1:5501",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5551",
      "roles": ["default"]
    }],
    "unreachable": [],
    "roles": [{
      "name": "default",
      "leader": "akka.tcp://cluster@172.17.0.1:5501"
    }]
  }
}


You now have a 2-node cluster with reachable nodes.

Example - A 2-node cluster after removal of unreachable node

To have a 3-node cluster again, you can startup the Service Context process on healthy hardware.

Example - Restore 3-node cluster

After starting up the SC, when you use the mzsh akka command, you can see that all of the nodes are reachable.

Example - Checking the status of an akka cluster

You can check the cluster status after adding a third node again:

$mzsh akka cluster-status cluster ".*"

The output indicates that all of the nodes are reachable:

{
  "sc1": {
    "leader": "akka.tcp://cluster@172.17.0.1:5501",
    "members": [{
      "address": "akka.tcp://cluster@172.17.0.1:5501",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5551",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5601",
      "roles": ["default"]
    }],
    "unreachable": [],
    "roles": [{
      "name": "default",
      "leader": "akka.tcp://cluster@172.17.0.1:5501"
    }]
  },
  "sc3": {
    "leader": "akka.tcp://cluster@172.17.0.1:5501",
    "members": [{
      "address": "akka.tcp://cluster@172.17.0.1:5501",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5551",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5601",
      "roles": ["default"]
    }],
    "unreachable": [],
    "roles": [{
      "name": "default",
      "leader": "akka.tcp://cluster@172.17.0.1:5501"
    }]
  },
  "sc2": {
    "leader": "akka.tcp://cluster@172.17.0.1:5501",
    "members": [{
      "address": "akka.tcp://cluster@172.17.0.1:5501",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5551",
      "roles": ["default"]
    }, {
      "address": "akka.tcp://cluster@172.17.0.1:5601",
      "roles": ["default"]
    }],
    "unreachable": [],
    "roles": [{
      "name": "default",
      "leader": "akka.tcp://cluster@172.17.0.1:5501"
    }]
  }
}

Example Scenario - Restoring a Cluster with one Node

In this scenario, the akka cluster is running on a single node, sc4, which fails.

You detect that the sc is unreachable using the mzsh akka command:

$mzsh akka cluster-status <akka cluster> <sc in regex>

Example - Checking the status of a cluster

You check the status of the cluster, named cluster service in this case, and check on the node, sc4:

$mzsh akka cluster-status cluster-service sc4

The output shows that no nodes are running:

{
}

When you have added a new node, you restart it and start the service again:

$mzsh startup <sc>
$mzsh service start

You can then check the status of the cluster again:

$mzsh akka cluster-status <akka cluster> <sc>

Example - Checking the status of an akka cluster

You check the status of the cluster again:

$mzsh akka cluster-status cluster-service ".*"
"sc4": {
    "leader": "akka.tcp://cluster-service@172.17.0.1:5801",
    "members": [{
      "address": "akka.tcp://cluster-service@172.17.0.1:5801",
      "roles": ["default"]
    }],
    "unreachable": [],
    "roles": [{
      "name": "default",
      "leader": "akka.tcp://cluster-service@172.17.0.1:5801"
    }]
  }
}


For further information on the mzsh akka command, see akka.