Provide fault tolerance with redundancy zones

12min
|
Enterprise
Consul

Enterprise Only

The redundancy zone functionality demonstrated here requires HashiCorp Cloud Platform (HCP) or self-managed Consul Enterprise. If you've purchased or wish to try out Consul Enterprise, refer to how to access Consul Enterprise.

This tutorial demonstrates how you can improve your Consul datacenter's fault resiliency by using redundancy zones.

These instructions demonstrate Consul's autopilot features, which make it possible to run one voter alongside any number of non-voters in each defined redundancy zone.

During this tutorial, you will deploy two servers, one voter and one non-voter, in each of the three cloud regions, for a total of six servers. To simplify this tutorial, we refer to these groups of servers with the following names:

group 1 is the servers that are voters in the initial deployment state.
group 2 is the servers that are non-voters in the initial deployment state.

After a server in group 1 fails, autopilot promotes a non-voter from the same zone to voter status automatically. As a result, Consul servers can continue operating without an effect on server quorum. For more information about Consul server redundancy and quorum, refer to the Consul reference architecture.

The following diagrams show the Consul architecture and its changes across the course of this tutorial:

The architecture diagram of the scenario. This shows the six Consul server nodes in the cluster.

Prerequisites

The tutorial assumes that you are familiar with Consul and its core functionality. If you are new to Consul, refer to the Consul Getting Started tutorials collection.

To complete this tutorial, you need the following software:

Consul Enterprise with a license
An AWS account configured for use with Terraform
git >= 2.0
aws-cli >= 2.0
terraform >= 1.4
jq >= 1.6

Clone GitHub repository

Clone the GitHub repository containing the configuration files and resources.

$ git clone https://github.com/hashicorp-education/learn-consul-redundancy-zones

Change into the directory that contains the complete configuration files for this tutorial.

$ cd learn-consul-redundancy-zones

This repository contains Terraform configurations to spin up the initial infrastructure, as well as files to automatically configure and deploy Consul.

This tutorial's repository contains the following items:

instance-scripts/ directory contains the bash scripts used to bootstrap and join the Consul servers running on EC2 instances
provisioning/ directory contains Consul agent configuration file templates
consul-instances.tf defines the EC2 instances the Consul servers run on
outputs.tf defines Terraform outputs you use to authenticate and connect to your EC2 instances
providers.tf contains provider definitions for Terraform
variables.tf defines variables you can use to customize the tutorial
vpc.tf defines the AWS VPC resources

The Terraform files provision the following billable AWS resources:

An AWS VPC
An AWS key pair
An AWS EC2 instance group running Consul server agent

Set up the Consul license

Redundancy zones are a Consul Enterprise feature, meaning that servers require an Enterprise license key. If you do not have a Consul Enterprise license, you can register for a 30 day trial license.

To start the tutorial, place your Consul Enterprise license file in the repository directory before you deploy the infrastructure. The Terraform file consul-instances.tf is configured to upload the license on your behalf. Ensure the filename is consul.hclic.

$ touch consul.hclic

Deploy your infrastructure

Initialize your Terraform configuration to download the necessary providers and modules.

$ terraform init

Initializing the backend...

Initializing provider plugins...
##...

Terraform has been successfully initialized!
##...

Then create the infrastructure. When prompted, enter yes to confirm the run.

Note

This tutorial targets AWS region `us-east-2` as its default. If you want to deploy to another region, modify the `terraform.tfvars` file accordingly.

$ terraform apply

##...

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

It takes a few minutes to deploy your infrastructure. After the deploy completes, it returns a list of outputs you need to complete the tutorial.

Apply complete! Resources: 25 added, 0 changed, 0 destroyed.

Outputs:

consul_group1_ips = [
  "3.76.213.176",
  "18.153.69.68",
  "3.72.40.212",
]
consul_group2_ips = [
  "3.79.240.36",
  "18.199.93.74",
  "3.121.185.195",
]
consul_token = <sensitive>
next_steps = [
  "You can now add the TLS certificate for accessing your EC2 instances by running:",
  "ssh-add ./tls-key.pem",
]

After Terraform deploys the infrastructure for this tutorial, you need to set up SSH access to the EC2 instances.

In order to log on to the instances, configure your SSH key manager agent to use the correct SSH key identity file.

$ ssh-add tls-key.pem
Identity added: tls-key.pem (tls-key.pem)

To make it easier to run remote commands on the instances, save the IP addresses of the Consul server nodes into a set of environment variables with the following command.

$ export GROUP1_SERVER0=ubuntu@$(terraform output -json 'consul_group1_ips' | jq -r '.[0]') && \
  export GROUP1_SERVER1=ubuntu@$(terraform output -json 'consul_group1_ips' | jq -r '.[1]') && \
  export GROUP1_SERVER2=ubuntu@$(terraform output -json 'consul_group1_ips' | jq -r '.[2]') && \
  export GROUP2_SERVER0=ubuntu@$(terraform output -json 'consul_group2_ips' | jq -r '.[0]') && \
  export GROUP2_SERVER1=ubuntu@$(terraform output -json 'consul_group2_ips' | jq -r '.[1]') && \
  export GROUP2_SERVER2=ubuntu@$(terraform output -json 'consul_group2_ips' | jq -r '.[2]')

Review Terraform configuration for server instances

Open consul-instances.tf. This Terraform configuration creates the following:

a TLS key pair that you can use to login to the server instances
a couple of AWS IAM policies for the instances so they can use Consul cloud join
two groups of EC2 instances that run Consul as servers

The EC2 instance uses a provisioning script instance-scripts/setup.sh that is executed by the cloud-init subsystem to automate the Consul client configuration and provisioning. This script installs the Consul agent package on the instance and sets up its Consul configuration file. The latter is automatically generated by Terraform for each Consul server instance.

Inspect the consul-server-group1 resource in the consul-instances.tf file. The following output is trimmed for brevity.

resource "aws_instance" "consul-server-group1" {
  count                       = 3
  ami                         = data.aws_ami.ubuntu.id
  instance_type               = "t3.micro"
  iam_instance_profile        = aws_iam_instance_profile.profile_manage_instances.name

  ##...

    setup = base64gzip(templatefile("${path.module}/instance-scripts/setup.sh", {
      hostname = "consul-group1-server${count.index}",
      consul_license = base64encode(file("${path.module}/consul.hclic")),
      consul_ca = base64encode(tls_self_signed_cert.consul_ca_cert.cert_pem),
      consul_config = base64encode(templatefile("${path.module}/provisioning/templates/consul-server.json", {
        count = count.index,
        datacenter = var.datacenter,
        token = random_uuid.consul_bootstrap_token.result,
        retry_join = "provider=aws tag_key=learn-consul-redundancy-zones tag_value=join",
      })),
      consul_acl_token = random_uuid.consul_bootstrap_token.result,
      consul_version   = var.consul_version,
      vpc_cidr    = module.vpc.vpc_cidr_block,
    })),
  })

  tags = {
    Name = "consul-group1-server${count.index}"
    learn-consul-redundancy-zones = "join"
  }
}

On line 2, the count directive causes Terraform to deploy three instances of this resource. Each instance's hostname is dynamically generated on line 10 by appending the instance number to the end of the consul-group1-server string.

Line 13 generates the Consul configuration file. It uses the template in provisioning/templates/consul-server.json and passes a few variables to it to enable the Consul cloud autojoin feature. Here is an example of the generated configuration for the consul-group1-server0 node. You will review this configuration in the next section of this tutorial.

consul-group1-server0 configuration file

{
  "acl": {
    "enabled": true,
    "down_policy": "async-cache",
    "default_policy": "deny",
    "tokens": {
      "agent": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
      "default": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
      "initial_management": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
    }
  },
  "datacenter": "dc1",
  "retry_join": [
    "provider=aws tag_key=learn-consul-redundancy-zones tag_value=join"
  ],
  "node_meta": {
    "zone": "zone0"
  },
  "autopilot": {
    "redundancy_zone_tag": "zone"
  },
  "license_path": "/etc/consul.d/consul.hclic",
  "encrypt": "",
  "encrypt_verify_incoming": false,
  "encrypt_verify_outgoing": false,
  "server": true,
  "bootstrap_expect": 3,
  "log_level": "INFO",
  "ui_config": {
    "enabled": true
  },
  "tls": {
    "defaults": {
      "ca_file": "/etc/consul.d/ca.pem",
      "verify_outgoing": false
    }
  },
  "ports": {
    "grpc": 8502
  },
  "bind_addr": "{{ GetPrivateInterfaces | include \"network\" \"10.0.0.0/16\" | attr \"address\" }}"
}

Finally, line 27 includes the learn-consul-redundancy-zones key and its value join. This configuration enables instances to identity and join other servers in a cluster.

Review configuration for redundancy zones

Inspect the configuration file on the first Consul instance in the first server group. If you get an error that there is no such file or directory, this means that the provisioning is still working on the instance. Wait for a few minutes more before you continue this tutorial.

$ ssh $GROUP1_SERVER0 "cat /etc/consul.d/client.json"

{
  "acl": {
    "enabled": true,
    "down_policy": "async-cache",
    "default_policy": "deny",
    "tokens": {
      "agent": "xxxxxxxxxxxxxxxxxx",
      "default": "xxxxxxxxxxxxxxxxxx",
      "initial_management": "xxxxxxxxxxxxxxxxxx"
    }
  },
  "datacenter": "dc1",
  "retry_join": [
    "provider=aws tag_key=learn-consul-redundancy-zones tag_value=join"
  ],
  "node_meta": {
    "zone": "zone0"
  },
  "autopilot": {
    "redundancy_zone_tag": "zone"
  },
  "license_path": "/etc/consul.d/consul.hclic",
  "encrypt": "",
  "encrypt_verify_incoming": false,
  "encrypt_verify_outgoing": false,
  "server": true,
  "bootstrap_expect": 3,
  "log_level": "INFO",
  "ui_config": {
    "enabled": true
  },
  "tls": {
    "defaults": {
      "ca_file": "/etc/consul.d/ca.pem",
      "verify_outgoing": false
    }
  },
  "ports": {
    "grpc": 8502
  },
  "bind_addr": "{{ GetPrivateInterfaces | include \"network\" \"10.0.0.0/16\" | attr \"address\" }}"
}

When you use Consul's availability zones functionality, every Consul instance must be assigned to a zone. A zone can have only one Consul server participate as a voter, but it can include multiple non-voter Consul servers. You define the zone with tags that designate the zone name.

In the provisioning template for the Consul servers, these zones are defined and configured according to the following code blocks:

    "node_meta": {
        "zone": "zone${count}"
    },
    "autopilot": {
        "redundancy_zone_tag": "zone"
    },

The name zone is arbitrary and could be anything. If you change the name, we recommend that you use the same tag name on all Consul servers.

You can inspect the configured zone tag with a direct query to the Consul server agent on the deployed instance.

$ ssh $GROUP1_SERVER0 "consul operator autopilot get-config"

CleanupDeadServers = true
LastContactThreshold = 200ms
MaxTrailingLogs = 250
MinQuorum = 0
ServerStabilizationTime = 10s
RedundancyZoneTag = "zone"
DisableUpgradeMigration = false
UpgradeVersionTag = ""

You can inspect the configured zone tag with a direct query to the Consul server agent on the deployed instance. You can inspect the node's tag configuration with a query to the /agent/self API endpoint of the Consul server agent on the deployed instance.

$ ssh $GROUP1_SERVER0 "curl --silent localhost:8500/v1/agent/self" | jq .Meta
{
  "consul-network-segment": "",
  "consul-version": "1.17.3",
  "zone": "zone0"
}

Tip

To change a zone tag without reloading the Consul configuration file, use the consul operator autopilot set-config -redundancy-zone-tag=<tag-name> command or the related API endpoint.

Review voting status for Consul servers

Run the consul operator command on the first Consul server from the first server group and review which nodes are voters and which ones are non-voters. Your results may be different based on which node was provisioned first. Refer to the Voter column in the output.

$ ssh $GROUP1_SERVER0 "consul operator raft list-peers"
Node                   ID                                    Address          State     Voter  RaftProtocol  Commit Index  Trails Leader By
consul-group1-server0  27f94c2a-9f12-1cfb-9357-a574919a7aa1  10.0.4.237:8300  leader    true   3             322           -
consul-group1-server1  ed563e54-3a26-aebd-9565-23d21609d22d  10.0.4.246:8300  follower  true   3             322           0 commits
consul-group1-server2  a1091fef-d90b-72d3-da61-d6bc60f2ed04  10.0.4.97:8300   follower  true   3             322           0 commits
consul-group2-server0  36747824-080a-a693-b419-ae3309de3389  10.0.4.242:8300  follower  false  3             322           0 commits
consul-group2-server1  cda2c288-c01b-0704-fcf8-336aba213b98  10.0.4.93:8300   follower  false  3             322           0 commits
consul-group2-server2  ca684698-68e7-e759-501d-d575c8cd41ec  10.0.4.186:8300  follower  false  3             322           0 commits

In this case, the voting servers are consul-group1-server0, consul-group1-server1 and consul-group1-server2.

If all six servers are voters, make sure your Consul license includes the Redundancy Zone feature set. Run the following command and inspect your license.

$ ssh $GROUP1_SERVER0 "consul license get"
License is valid
License ID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Customer ID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Expires At: 2030-04-27 00:00:00 +0000 UTC
Terminates At: 2035-04-27 00:00:00 +0000 UTC
Non-terminating: false
Datacenter: *
Modules:
        Global Visibility, Routing and Scale
        Governance and Policy
Licensed Features:
        Automated Backups
        Automated Upgrades
        Enhanced Read Scalability
        Network Segments
        Redundancy Zone
        Advanced Network Federation
        Namespaces
        SSO
        Audit Logging
        Admin Partitions

Test fault tolerance

In the next part of this tutorial, you will test the fault tolerance of the Consul cluster by simulating the failure of one server in a redundancy zone, then all servers in a redundancy zone. Finally, you will restart these servers and observe the results.

Stop one server in a zone

To verify that redundancy zones are configured correctly, stop one of the voters and check that the non-voter in its redundancy zone becomes a voter. Because consul-group1-server0 is currently a voter, you can terminate the Consul server agent without notice to simulate a failure.

$ ssh $GROUP1_SERVER0 "sudo systemctl --signal=SIGKILL stop consul"

Select another instance and inspect the status of the cluster. The following command runs on the second server in group 1.

$ ssh $GROUP1_SERVER1 "consul operator raft list-peers"
Node                   ID                                    Address          State     Voter  RaftProtocol  Commit Index  Trails Leader By
consul-group1-server1  ed563e54-3a26-aebd-9565-23d21609d22d  10.0.4.246:8300  follower  true   3             466           0 commits
consul-group1-server2  a1091fef-d90b-72d3-da61-d6bc60f2ed04  10.0.4.97:8300   leader    true   3             466           -
consul-group2-server0  36747824-080a-a693-b419-ae3309de3389  10.0.4.242:8300  follower  true   3             466           0 commits
consul-group2-server1  cda2c288-c01b-0704-fcf8-336aba213b98  10.0.4.93:8300   follower  false  3             466           0 commits
consul-group2-server2  ca684698-68e7-e759-501d-d575c8cd41ec  10.0.4.186:8300  follower  false  3             466           0 commits

After the leader node consul-group1-server0 failed, three events took place:

consul-group2-server0, the non-voter server in zone0, was promoted to a voter.
consul-group1-server2 was elected leader.
consul-group1-server0 was removed from the list of peers because Consul autopilot executed dead server cleanup.

To check on the status of consul-group1-server0, run the consul members command.

$ ssh $GROUP1_SERVER1 "consul members"
Node                   Address          Status  Type    Build       Protocol  DC   Partition  Segment
consul-group1-server0  10.0.4.254:8301  left    server  1.17.3+ent  2         dc1  default    <all>
consul-group1-server1  10.0.4.68:8301   alive   server  1.17.3+ent  2         dc1  default    <all>
consul-group1-server2  10.0.4.165:8301  alive   server  1.17.3+ent  2         dc1  default    <all>
consul-group2-server0  10.0.4.184:8301  alive   server  1.17.3+ent  2         dc1  default    <all>
consul-group2-server1  10.0.4.53:8301   alive   server  1.17.3+ent  2         dc1  default    <all>
consul-group2-server2  10.0.4.42:8301   alive   server  1.17.3+ent  2         dc1  default    <all>

By default, all Consul server nodes in a zone have the potential to become that zone's voter. To explicitly forbid one or more Consul servers from ever becoming a voter, use enhanced read scalability. When you set the agent's non_voting_server flag to true, the Consul server helps ease read load from the other voting servers but does not participate in voter elections, even if all of the other voter servers in their zone fail.

Stop all servers in a zone

After you shut down consul-group1-server0, there is only one server left in zone0. Run the following command to stop the remaining Consul server in zone0, which simulates a total zone failure.

$ ssh $GROUP2_SERVER0 "sudo systemctl --signal=SIGKILL stop consul"

Inspect the status of the cluster. Run the command against the second server from the first server group, or any other instance that still runs.

$ ssh $GROUP1_SERVER1 "consul operator raft list-peers"
Node                   ID                                    Address          State     Voter  RaftProtocol  Commit Index  Trails Leader By
consul-group1-server1  7813ce06-16dd-d06a-7343-46dd4ca51a11  10.0.4.68:8300   leader    true   3             648           -
consul-group1-server2  dd1374e1-1eca-e75c-d9f2-19fedc102824  10.0.4.165:8300  follower  true   3             648           0 commits
consul-group2-server0  b13d5391-688e-8eaf-a749-88e1ed85a104  10.0.4.184:8300  follower  false  3             580           68 commits
consul-group2-server1  d3054163-3822-0fb0-560f-bc6548ed40a9  10.0.4.53:8300   follower  true   3             648           0 commits
consul-group2-server2  9d4d2fc0-80e5-df80-1ab8-c19874b0a8c0  10.0.4.42:8300   follower  false  3             648           0 commits

After the node consul-group2-server0 failed, two events took place events took place:

consul-group2-server1 was promoted to a voter.
consul-group2-server0 began to trail the leader's index.

In order to preserve quorum of 3 voting nodes, Consul Autopilot promotes an available server from a different zone, even if that zone already has a voter.

Inspect the Consul Autopilot state and verify that the extra voter in zone1 was promoted because of the failure of all nodes in zone0. The output is trimmed for brevity.

$ ssh $GROUP1_SERVER1 "consul operator autopilot state -format=json | jq -r \" (.Servers | .[] | [.Name, .RedundancyZone, .NodeType]) | @tsv \" | sort"
consul-group1-server1   zone1   zone-voter
consul-group1-server2   zone2   zone-voter
consul-group2-server0   zone0   zone-voter
consul-group2-server1   zone1   zone-extra-voter
consul-group2-server2   zone2   zone-standby

The Node Type describes the voter status.

zone-voter indicates that autopilot designates this server to be the voter for the specific zone.
zone-standby indicates that autopilot designates this server to become the voter if a voter from the zone fails.
zone-extra-voter indicates that autopilot designates this server as available to become a voter due to a failure of all servers in another zone. When one of the servers in the failed zone is restored, this server is automatically demoted.

Explore the command's full output. It includes the Consul server node's name, its zone, and its role.

$ consul operator autopilot state

$ ssh $GROUP1_SERVER1 "consul operator autopilot state"
Healthy:                      false
Failure Tolerance:            1
Optimistic Failure Tolerance: 2
Leader:                       7813ce06-16dd-d06a-7343-46dd4ca51a11
Voters:
   7813ce06-16dd-d06a-7343-46dd4ca51a11
   dd1374e1-1eca-e75c-d9f2-19fedc102824
   d3054163-3822-0fb0-560f-bc6548ed40a9
Redundancy Zones:
   zone0:
      Failure Tolerance: 0
      Voters:
      Servers:
         b13d5391-688e-8eaf-a749-88e1ed85a104
   zone1:
      Failure Tolerance: 1
      Voters:
         7813ce06-16dd-d06a-7343-46dd4ca51a11
         d3054163-3822-0fb0-560f-bc6548ed40a9
      Servers:
         7813ce06-16dd-d06a-7343-46dd4ca51a11
         d3054163-3822-0fb0-560f-bc6548ed40a9
   zone2:
      Failure Tolerance: 1
      Voters:
         dd1374e1-1eca-e75c-d9f2-19fedc102824
      Servers:
         dd1374e1-1eca-e75c-d9f2-19fedc102824
         9d4d2fc0-80e5-df80-1ab8-c19874b0a8c0
Upgrade:
   Status:         idle
   Target Version: 1.17.3
   Target Version Voters:
      7813ce06-16dd-d06a-7343-46dd4ca51a11
      dd1374e1-1eca-e75c-d9f2-19fedc102824
      d3054163-3822-0fb0-560f-bc6548ed40a9
   Target Version Non-Voters:
      9d4d2fc0-80e5-df80-1ab8-c19874b0a8c0
      b13d5391-688e-8eaf-a749-88e1ed85a104
Servers:
   7813ce06-16dd-d06a-7343-46dd4ca51a11
      Name:            consul-group1-server1
      Address:         10.0.4.68:8300
      Version:         1.17.3
      Status:          leader
      Node Type:       zone-voter
      Node Status:     alive
      Healthy:         true
      Last Contact:    0s
      Last Term:       4
      Last Index:      1464
      Redundancy Zone: zone1
      Upgrade Version: 1.17.3
      Meta
         "consul-network-segment": ""
         "consul-version": "1.17.3"
         "zone": "zone1"
   9d4d2fc0-80e5-df80-1ab8-c19874b0a8c0
      Name:            consul-group2-server2
      Address:         10.0.4.42:8300
      Version:         1.17.3
      Status:          non-voter
      Node Type:       zone-standby
      Node Status:     alive
      Healthy:         true
      Last Contact:    75.292238ms
      Last Term:       4
      Last Index:      1464
      Redundancy Zone: zone2
      Upgrade Version: 1.17.3
      Meta
         "consul-network-segment": ""
         "consul-version": "1.17.3"
         "zone": "zone2"
   b13d5391-688e-8eaf-a749-88e1ed85a104
      Name:            consul-group2-server0
      Address:         10.0.4.184:8300
      Version:         1.17.3
      Status:          non-voter
      Node Type:       zone-voter
      Node Status:     failed
      Healthy:         false
      Last Contact:    19.749669ms
      Last Term:       4
      Last Index:      580
      Redundancy Zone: zone0
      Upgrade Version: 1.17.3
      Meta
         "consul-network-segment": ""
         "consul-version": "1.17.3"
         "zone": "zone0"
   d3054163-3822-0fb0-560f-bc6548ed40a9
      Name:            consul-group2-server1
      Address:         10.0.4.53:8300
      Version:         1.17.3
      Status:          voter
      Node Type:       zone-extra-voter
      Node Status:     alive
      Healthy:         true
      Last Contact:    31.604337ms
      Last Term:       4
      Last Index:      1464
      Redundancy Zone: zone1
      Upgrade Version: 1.17.3
      Meta
         "consul-network-segment": ""
         "consul-version": "1.17.3"
         "zone": "zone1"
   dd1374e1-1eca-e75c-d9f2-19fedc102824
      Name:            consul-group1-server2
      Address:         10.0.4.165:8300
      Version:         1.17.3
      Status:          voter
      Node Type:       zone-voter
      Node Status:     alive
      Healthy:         true
      Last Contact:    44.459057ms
      Last Term:       4
      Last Index:      1464
      Redundancy Zone: zone2
      Upgrade Version: 1.17.3
      Meta
         "consul-network-segment": ""
         "consul-version": "1.17.3"
         "zone": "zone2"

The other effect of shutting down the second node in zone0 is that the output of the consul operator raft list-peers command displayed earlier shows that consul-group2-server0 is still in the Raft peers list, however as a non-Voter with a trailing Raft index. The reason this node is still in the list is because no other node was available in its zone so Consul Autopilot did not execute its dead server cleanup.

Run the following command to inspect the Consul cluster members and their status.

$ ssh $GROUP1_SERVER1 "consul members"
Node                   Address          Status  Type    Build       Protocol  DC   Partition  Segment
consul-group1-server0  10.0.4.254:8301  left    server  1.17.3+ent  2         dc1  default    <all>
consul-group1-server1  10.0.4.68:8301   alive   server  1.17.3+ent  2         dc1  default    <all>
consul-group1-server2  10.0.4.165:8301  alive   server  1.17.3+ent  2         dc1  default    <all>
consul-group2-server0  10.0.4.184:8301  failed  server  1.17.3+ent  2         dc1  default    <all>
consul-group2-server1  10.0.4.53:8301   alive   server  1.17.3+ent  2         dc1  default    <all>
consul-group2-server2  10.0.4.42:8301   alive   server  1.17.3+ent  2         dc1  default    <all>

The status of consul-group2-server0 is failed. Compare it to the status of consul-group1-server0, which was at first marked as failed. However, when consul-group2-server0 stepped into its role, it was ejected from the cluster by Consul Autopilot and marked as left.

Recover all servers in a zone

Next, observe what happens when you recover the servers in zone0. Execute the following command to restart the Consul server agents on both server instances.

$ ssh $GROUP1_SERVER0 "sudo systemctl start consul" && \
    ssh $GROUP2_SERVER0 "sudo systemctl start consul"

Wait for a few minutes for the Consul servers to start. Aftewards, inspect the Consul cluster members state.

$ ssh $GROUP1_SERVER1 "consul members"                  
Node                   Address          Status  Type    Build       Protocol  DC   Partition  Segment
consul-group1-server0  10.0.4.126:8301  alive   server  1.17.3+ent  2         dc1  default    <all>
consul-group1-server1  10.0.4.68:8301   alive   server  1.17.3+ent  2         dc1  default    <all>
consul-group1-server2  10.0.4.165:8301  alive   server  1.17.3+ent  2         dc1  default    <all>
consul-group2-server0  10.0.4.111:8301  alive   server  1.17.3+ent  2         dc1  default    <all>
consul-group2-server1  10.0.4.53:8301   alive   server  1.17.3+ent  2         dc1  default    <all>
consul-group2-server2  10.0.4.42:8301   alive   server  1.17.3+ent  2         dc1  default    <all>

All Consul server nodes are back in the cluster and their state is alive. The next command shows the Raft peer set of the cluster and their voting status.

$ ssh $GROUP1_SERVER1 "consul operator raft list-peers"
Node                   ID                                    Address          State     Voter  RaftProtocol  Commit Index  Trails Leader By
consul-group1-server1  7813ce06-16dd-d06a-7343-46dd4ca51a11  10.0.4.68:8300   leader    true   3             1870          -
consul-group1-server2  dd1374e1-1eca-e75c-d9f2-19fedc102824  10.0.4.165:8300  follower  true   3             1870          0 commits
consul-group2-server1  d3054163-3822-0fb0-560f-bc6548ed40a9  10.0.4.53:8300   follower  false  3             1870          0 commits
consul-group2-server2  9d4d2fc0-80e5-df80-1ab8-c19874b0a8c0  10.0.4.42:8300   follower  false  3             1870          0 commits
consul-group2-server0  4fe23f52-dc33-7ba0-ff5a-648a842a978d  10.0.4.111:8300  follower  true   3             1870          0 commits
consul-group1-server0  59b33708-1874-e62a-261d-cff58a69b3f8  10.0.4.126:8300  follower  false  3             1870          0 commits

Notice that in this case consul-group2-server0 has become the voter for zone0, and also consul-group1-server0 has returned to the list. Finally, inspect the Consul Autopilot node roles for the cluster.

$ ssh $GROUP1_SERVER1 "consul operator autopilot state -format=json | jq -r \" (.Servers | .[] | [.Name, .RedundancyZone, .NodeType]) | @tsv \" | sort"
consul-group1-server0   zone0   zone-standby
consul-group1-server1   zone1   zone-voter
consul-group1-server2   zone2   zone-voter
consul-group2-server0   zone0   zone-voter
consul-group2-server1   zone1   zone-standby
consul-group2-server2   zone2   zone-standby

The cluster state was recovered. There are three voters and three non-voters in total. There is no priority for previous voters to return to their voting state. The first node to join the cluster in an empty zone becomes a voter, and any other nodes that join after it are treated as non-voters.

Clean up environment

Destroy the Terraform resources to clean up your environment. Enter yes to confirm the destroy operation.

$ terraform destroy
##...

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

  Enter a value: yes

##...

Destroy complete! Resources: 25 destroyed.

Due to race conditions with the various cloud resources created in this tutorial, you may need to run the destroy operation twice to ensure all resources have been properly removed.

Next steps

In this tutorial you learned how to configure Consul Redundancy Zones in a pool of Consul server nodes and use them as hot standby instances in case one of the server voters fails. You observed how once a Consul server voter fails, another one from its zone is elected for the voter role.

Consul Redundancy Zones is a part of the Autopilot functionality set. To learn more about Autopilot, go to the Day 2 Operations: Autopilot tutorial next.

Backup Consul

Disaster recovery