Provide fault tolerance with redundancy zones
Enterprise Only
The redundancy zone functionality demonstrated here requires HashiCorp Cloud Platform (HCP) or self-managed Consul Enterprise. If you've purchased or wish to try out Consul Enterprise, refer to how to access Consul Enterprise.
This tutorial demonstrates how you can improve your Consul datacenter's fault resiliency by using redundancy zones.
These instructions demonstrate Consul's autopilot features, which make it possible to run one voter alongside any number of non-voters in each defined redundancy zone.
During this tutorial, you will deploy two servers, one voter and one non-voter, in each of the three cloud regions, for a total of six servers. To simplify this tutorial, we refer to these groups of servers with the following names:
group 1
is the servers that are voters in the initial deployment 2
is the servers that are non-voters in the initial deployment state.
After a server in group 1
fails, autopilot promotes a non-voter from the same zone to voter status automatically. As a result, Consul servers can continue operating without an effect on server quorum. For more information about Consul server redundancy and quorum, refer to the Consul reference architecture.
The following diagrams show the Consul architecture and its changes across the course of this tutorial:
The tutorial assumes that you are familiar with Consul and its core functionality. If you are new to Consul, refer to the Consul Getting Started tutorials collection.
To complete this tutorial, you need the following software:
- Consul Enterprise with a license
- An AWS account configured for use with Terraform
- git >= 2.0
- aws-cli >= 2.0
- terraform >= 1.4
- jq >= 1.6
Clone GitHub repository
Clone the GitHub repository containing the configuration files and resources.
$ git clone
Change into the directory that contains the complete configuration files for this tutorial.
$ cd learn-consul-redundancy-zones
This repository contains Terraform configurations to spin up the initial infrastructure, as well as files to automatically configure and deploy Consul.
This tutorial's repository contains the following items:
directory contains the bash scripts used to bootstrap and join the Consul servers running on EC2 instancesprovisioning/
directory contains Consul agent configuration file
defines the EC2 instances the Consul servers run
defines Terraform outputs you use to authenticate and connect to your EC2
contains provider definitions for
defines variables you can use to customize the
defines the AWS VPC resources
The Terraform files provision the following billable AWS resources:
- An AWS key pair
- An AWS EC2 instance group running Consul server agent
Set up the Consul license
Redundancy zones are a Consul Enterprise feature, meaning that servers require an Enterprise license key. If you do not have a Consul Enterprise license, you can register for a 30 day trial license.
To start the tutorial, place your Consul Enterprise license file in the repository directory before you deploy the infrastructure. The Terraform file
is configured to upload the license on your behalf. Ensure the filename is consul.hclic
$ touch consul.hclic
Deploy your infrastructure
Initialize your Terraform configuration to download the necessary providers and modules.
$ terraform init
Initializing the backend...
Initializing provider plugins...
Terraform has been successfully initialized!
Then create the infrastructure. When prompted, enter yes
to confirm the run.
This tutorial targets AWS region `us-east-2` as its default. If you want to deploy to another region, modify the `terraform.tfvars` file accordingly.$ terraform apply
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
It takes a few minutes to deploy your infrastructure. After the deploy completes, it returns a list of outputs you need to complete the tutorial.
Apply complete! Resources: 25 added, 0 changed, 0 destroyed.
consul_group1_ips = [
consul_group2_ips = [
consul_token = <sensitive>
next_steps = [
"You can now add the TLS certificate for accessing your EC2 instances by running:",
"ssh-add ./tls-key.pem",
After Terraform deploys the infrastructure for this tutorial, you need to set up SSH access to the EC2 instances.
In order to log on to the instances, configure your SSH key manager agent to use the correct SSH key identity file.
$ ssh-add tls-key.pem
Identity added: tls-key.pem (tls-key.pem)
To make it easier to run remote commands on the instances, save the IP addresses of the Consul server nodes into a set of environment variables with the following command.
$ export GROUP1_SERVER0=ubuntu@$(terraform output -json 'consul_group1_ips' | jq -r '.[0]') && \
export GROUP1_SERVER1=ubuntu@$(terraform output -json 'consul_group1_ips' | jq -r '.[1]') && \
export GROUP1_SERVER2=ubuntu@$(terraform output -json 'consul_group1_ips' | jq -r '.[2]') && \
export GROUP2_SERVER0=ubuntu@$(terraform output -json 'consul_group2_ips' | jq -r '.[0]') && \
export GROUP2_SERVER1=ubuntu@$(terraform output -json 'consul_group2_ips' | jq -r '.[1]') && \
export GROUP2_SERVER2=ubuntu@$(terraform output -json 'consul_group2_ips' | jq -r '.[2]')
Review Terraform configuration for server instances
. This Terraform configuration creates the following:
- a TLS key pair that you can use to login to the server instances
- a couple of AWS IAM policies for the instances so they can use Consul cloud join
- two groups of EC2 instances that run Consul as servers
The EC2 instance uses a provisioning script instance-scripts/
that is executed by the cloud-init
subsystem to automate the Consul client configuration and provisioning. This script installs the Consul agent package on the instance and sets up its Consul configuration file. The latter is automatically generated by Terraform for each Consul server instance.
Inspect the consul-server-group1
resource in the
file. The following output is trimmed for brevity.
resource "aws_instance" "consul-server-group1" {
count = 3
ami =
instance_type = "t3.micro"
iam_instance_profile =
setup = base64gzip(templatefile("${path.module}/instance-scripts/", {
hostname = "consul-group1-server${count.index}",
consul_license = base64encode(file("${path.module}/consul.hclic")),
consul_ca = base64encode(tls_self_signed_cert.consul_ca_cert.cert_pem),
consul_config = base64encode(templatefile("${path.module}/provisioning/templates/consul-server.json", {
count = count.index,
datacenter = var.datacenter,
token = random_uuid.consul_bootstrap_token.result,
retry_join = "provider=aws tag_key=learn-consul-redundancy-zones tag_value=join",
consul_acl_token = random_uuid.consul_bootstrap_token.result,
consul_version = var.consul_version,
vpc_cidr = module.vpc.vpc_cidr_block,
tags = {
Name = "consul-group1-server${count.index}"
learn-consul-redundancy-zones = "join"
On line 2, the count
directive causes Terraform to deploy three instances of this resource. Each instance's hostname is dynamically generated on line 10 by appending the instance number to the end of the consul-group1-server
Line 13 generates the Consul configuration file. It uses the template in provisioning/templates/consul-server.json
and passes a few variables to it to enable the Consul cloud autojoin feature. Here is an example of the generated configuration for the consul-group1-server0
node. You will review this configuration in the next section of this tutorial.
"acl": {
"enabled": true,
"down_policy": "async-cache",
"default_policy": "deny",
"tokens": {
"agent": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"default": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"initial_management": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
"datacenter": "dc1",
"retry_join": [
"provider=aws tag_key=learn-consul-redundancy-zones tag_value=join"
"node_meta": {
"zone": "zone0"
"autopilot": {
"redundancy_zone_tag": "zone"
"license_path": "/etc/consul.d/consul.hclic",
"encrypt": "",
"encrypt_verify_incoming": false,
"encrypt_verify_outgoing": false,
"server": true,
"bootstrap_expect": 3,
"log_level": "INFO",
"ui_config": {
"enabled": true
"tls": {
"defaults": {
"ca_file": "/etc/consul.d/ca.pem",
"verify_outgoing": false
"ports": {
"grpc": 8502
"bind_addr": "{{ GetPrivateInterfaces | include \"network\" \"\" | attr \"address\" }}"
Finally, line 27 includes the learn-consul-redundancy-zones
key and its value join
. This configuration enables instances to identity and join other servers in a cluster.
Review configuration for redundancy zones
Inspect the configuration file on the first Consul instance in the first server group. If you get an error that there is no such file or directory, this means that the provisioning is still working on the instance. Wait for a few minutes more before you continue this tutorial.
$ ssh $GROUP1_SERVER0 "cat /etc/consul.d/client.json"
"acl": {
"enabled": true,
"down_policy": "async-cache",
"default_policy": "deny",
"tokens": {
"agent": "xxxxxxxxxxxxxxxxxx",
"default": "xxxxxxxxxxxxxxxxxx",
"initial_management": "xxxxxxxxxxxxxxxxxx"
"datacenter": "dc1",
"retry_join": [
"provider=aws tag_key=learn-consul-redundancy-zones tag_value=join"
"node_meta": {
"zone": "zone0"
"autopilot": {
"redundancy_zone_tag": "zone"
"license_path": "/etc/consul.d/consul.hclic",
"encrypt": "",
"encrypt_verify_incoming": false,
"encrypt_verify_outgoing": false,
"server": true,
"bootstrap_expect": 3,
"log_level": "INFO",
"ui_config": {
"enabled": true
"tls": {
"defaults": {
"ca_file": "/etc/consul.d/ca.pem",
"verify_outgoing": false
"ports": {
"grpc": 8502
"bind_addr": "{{ GetPrivateInterfaces | include \"network\" \"\" | attr \"address\" }}"
When you use Consul's availability zones functionality, every Consul instance must be assigned to a zone. A zone can have only one Consul server participate as a voter, but it can include multiple non-voter Consul servers. You define the zone with tags that designate the zone name.
In the provisioning template for the Consul servers, these zones are defined and configured according to the following code blocks:
"node_meta": {
"zone": "zone${count}"
"autopilot": {
"redundancy_zone_tag": "zone"
The name zone
is arbitrary and could be anything. If you change the name, we recommend that you use the same tag name on all Consul servers.
You can inspect the configured zone tag with a direct query to the Consul server agent on the deployed instance.
$ ssh $GROUP1_SERVER0 "consul operator autopilot get-config"
CleanupDeadServers = true
LastContactThreshold = 200ms
MaxTrailingLogs = 250
MinQuorum = 0
ServerStabilizationTime = 10s
RedundancyZoneTag = "zone"
DisableUpgradeMigration = false
UpgradeVersionTag = ""
You can inspect the configured zone tag with a direct query to the Consul server agent on the deployed instance.
You can inspect the node's tag configuration with a query to the /agent/self
API endpoint of the Consul server agent on the deployed instance.
$ ssh $GROUP1_SERVER0 "curl --silent localhost:8500/v1/agent/self" | jq .Meta
"consul-network-segment": "",
"consul-version": "1.17.3",
"zone": "zone0"
To change a zone tag without reloading the Consul configuration file, use the consul operator autopilot set-config -redundancy-zone-tag=<tag-name>
command or the related API endpoint.
Review voting status for Consul servers
Run the consul operator
command on the first Consul server from the first server group and review which nodes are voters and which ones are non-voters. Your results may be different based on which node was provisioned first. Refer to the Voter
column in the output.
$ ssh $GROUP1_SERVER0 "consul operator raft list-peers"
Node ID Address State Voter RaftProtocol Commit Index Trails Leader By
consul-group1-server0 27f94c2a-9f12-1cfb-9357-a574919a7aa1 leader true 3 322 -
consul-group1-server1 ed563e54-3a26-aebd-9565-23d21609d22d follower true 3 322 0 commits
consul-group1-server2 a1091fef-d90b-72d3-da61-d6bc60f2ed04 follower true 3 322 0 commits
consul-group2-server0 36747824-080a-a693-b419-ae3309de3389 follower false 3 322 0 commits
consul-group2-server1 cda2c288-c01b-0704-fcf8-336aba213b98 follower false 3 322 0 commits
consul-group2-server2 ca684698-68e7-e759-501d-d575c8cd41ec follower false 3 322 0 commits
In this case, the voting servers are consul-group1-server0
, consul-group1-server1
and consul-group1-server2
If all six servers are voters, make sure your Consul license includes the Redundancy Zone
feature set. Run the following command and inspect your license.
$ ssh $GROUP1_SERVER0 "consul license get"
License is valid
License ID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Customer ID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Expires At: 2030-04-27 00:00:00 +0000 UTC
Terminates At: 2035-04-27 00:00:00 +0000 UTC
Non-terminating: false
Datacenter: *
Global Visibility, Routing and Scale
Governance and Policy
Licensed Features:
Automated Backups
Automated Upgrades
Enhanced Read Scalability
Network Segments
Redundancy Zone
Advanced Network Federation
Audit Logging
Admin Partitions
Test fault tolerance
In the next part of this tutorial, you will test the fault tolerance of the Consul cluster by simulating the failure of one server in a redundancy zone, then all servers in a redundancy zone. Finally, you will restart these servers and observe the results.
Stop one server in a zone
To verify that redundancy zones are configured correctly, stop one of the voters and check that the non-voter in its redundancy zone becomes a voter. Because consul-group1-server0
is currently a voter, you can terminate the Consul server agent without notice to simulate a failure.
$ ssh $GROUP1_SERVER0 "sudo systemctl --signal=SIGKILL stop consul"
Select another instance and inspect the status of the cluster. The following command runs on the second server in group 1.
$ ssh $GROUP1_SERVER1 "consul operator raft list-peers"
Node ID Address State Voter RaftProtocol Commit Index Trails Leader By
consul-group1-server1 ed563e54-3a26-aebd-9565-23d21609d22d follower true 3 466 0 commits
consul-group1-server2 a1091fef-d90b-72d3-da61-d6bc60f2ed04 leader true 3 466 -
consul-group2-server0 36747824-080a-a693-b419-ae3309de3389 follower true 3 466 0 commits
consul-group2-server1 cda2c288-c01b-0704-fcf8-336aba213b98 follower false 3 466 0 commits
consul-group2-server2 ca684698-68e7-e759-501d-d575c8cd41ec follower false 3 466 0 commits
After the leader node consul-group1-server0
failed, three events took place:
, the non-voter server inzone0
, was promoted to a voter.consul-group1-server2
was elected leader.consul-group1-server0
was removed from the list of peers because Consul autopilot executed dead server cleanup.
To check on the status of consul-group1-server0
, run the consul members
$ ssh $GROUP1_SERVER1 "consul members"
Node Address Status Type Build Protocol DC Partition Segment
consul-group1-server0 left server 1.17.3+ent 2 dc1 default <all>
consul-group1-server1 alive server 1.17.3+ent 2 dc1 default <all>
consul-group1-server2 alive server 1.17.3+ent 2 dc1 default <all>
consul-group2-server0 alive server 1.17.3+ent 2 dc1 default <all>
consul-group2-server1 alive server 1.17.3+ent 2 dc1 default <all>
consul-group2-server2 alive server 1.17.3+ent 2 dc1 default <all>
By default, all Consul server nodes in a zone have the potential to become that zone's voter. To explicitly forbid one or more Consul servers from ever becoming a voter, use enhanced read
scalability. When you set the agent's non_voting_server
flag to true, the Consul server helps ease read load from the other voting servers but does not participate in voter elections, even if all of the other voter servers in their zone fail.
Stop all servers in a zone
After you shut down consul-group1-server0
, there is only one server left in zone0
. Run the following command to stop the remaining Consul server in zone0
, which simulates a total zone failure.
$ ssh $GROUP2_SERVER0 "sudo systemctl --signal=SIGKILL stop consul"
Inspect the status of the cluster. Run the command against the second server from the first server group, or any other instance that still runs.
$ ssh $GROUP1_SERVER1 "consul operator raft list-peers"
Node ID Address State Voter RaftProtocol Commit Index Trails Leader By
consul-group1-server1 7813ce06-16dd-d06a-7343-46dd4ca51a11 leader true 3 648 -
consul-group1-server2 dd1374e1-1eca-e75c-d9f2-19fedc102824 follower true 3 648 0 commits
consul-group2-server0 b13d5391-688e-8eaf-a749-88e1ed85a104 follower false 3 580 68 commits
consul-group2-server1 d3054163-3822-0fb0-560f-bc6548ed40a9 follower true 3 648 0 commits
consul-group2-server2 9d4d2fc0-80e5-df80-1ab8-c19874b0a8c0 follower false 3 648 0 commits
After the node consul-group2-server0
failed, two events took place events took place:
was promoted to a voter.consul-group2-server0
began to trail the leader's index.
In order to preserve quorum of 3 voting nodes, Consul Autopilot promotes an available server from a different zone, even if that zone already has a voter.
Inspect the Consul Autopilot state and verify that the extra voter in zone1
was promoted because of the failure of all nodes in zone0
. The output is trimmed for brevity.
$ ssh $GROUP1_SERVER1 "consul operator autopilot state -format=json | jq -r \" (.Servers | .[] | [.Name, .RedundancyZone, .NodeType]) | @tsv \" | sort"
consul-group1-server1 zone1 zone-voter
consul-group1-server2 zone2 zone-voter
consul-group2-server0 zone0 zone-voter
consul-group2-server1 zone1 zone-extra-voter
consul-group2-server2 zone2 zone-standby
The Node Type
describes the voter status.
indicates that autopilot designates this server to be the voter for the specific
indicates that autopilot designates this server to become the voter if a voter from the zone
indicates that autopilot designates this server as available to become a voter due to a failure of all servers in another zone. When one of the servers in the failed zone is restored, this server is automatically demoted.
Explore the command's full output. It includes the Consul server node's name, its zone, and its role.
$ ssh $GROUP1_SERVER1 "consul operator autopilot state"
Healthy: false
Failure Tolerance: 1
Optimistic Failure Tolerance: 2
Leader: 7813ce06-16dd-d06a-7343-46dd4ca51a11
Redundancy Zones:
Failure Tolerance: 0
Failure Tolerance: 1
Failure Tolerance: 1
Status: idle
Target Version: 1.17.3
Target Version Voters:
Target Version Non-Voters:
Name: consul-group1-server1
Version: 1.17.3
Status: leader
Node Type: zone-voter
Node Status: alive
Healthy: true
Last Contact: 0s
Last Term: 4
Last Index: 1464
Redundancy Zone: zone1
Upgrade Version: 1.17.3
"consul-network-segment": ""
"consul-version": "1.17.3"
"zone": "zone1"
Name: consul-group2-server2
Version: 1.17.3
Status: non-voter
Node Type: zone-standby
Node Status: alive
Healthy: true
Last Contact: 75.292238ms
Last Term: 4
Last Index: 1464
Redundancy Zone: zone2
Upgrade Version: 1.17.3
"consul-network-segment": ""
"consul-version": "1.17.3"
"zone": "zone2"
Name: consul-group2-server0
Version: 1.17.3
Status: non-voter
Node Type: zone-voter
Node Status: failed
Healthy: false
Last Contact: 19.749669ms
Last Term: 4
Last Index: 580
Redundancy Zone: zone0
Upgrade Version: 1.17.3
"consul-network-segment": ""
"consul-version": "1.17.3"
"zone": "zone0"
Name: consul-group2-server1
Version: 1.17.3
Status: voter
Node Type: zone-extra-voter
Node Status: alive
Healthy: true
Last Contact: 31.604337ms
Last Term: 4
Last Index: 1464
Redundancy Zone: zone1
Upgrade Version: 1.17.3
"consul-network-segment": ""
"consul-version": "1.17.3"
"zone": "zone1"
Name: consul-group1-server2
Version: 1.17.3
Status: voter
Node Type: zone-voter
Node Status: alive
Healthy: true
Last Contact: 44.459057ms
Last Term: 4
Last Index: 1464
Redundancy Zone: zone2
Upgrade Version: 1.17.3
"consul-network-segment": ""
"consul-version": "1.17.3"
"zone": "zone2"
The other effect of shutting down the second node in zone0
is that the output of the consul operator raft list-peers
command displayed earlier shows that consul-group2-server0
is still in the Raft peers list, however as a non-Voter with a trailing Raft index. The reason this node is still in the list is because no other node was available in its zone so Consul Autopilot did not execute its dead server cleanup.
Run the following command to inspect the Consul cluster members and their status.
$ ssh $GROUP1_SERVER1 "consul members"
Node Address Status Type Build Protocol DC Partition Segment
consul-group1-server0 left server 1.17.3+ent 2 dc1 default <all>
consul-group1-server1 alive server 1.17.3+ent 2 dc1 default <all>
consul-group1-server2 alive server 1.17.3+ent 2 dc1 default <all>
consul-group2-server0 failed server 1.17.3+ent 2 dc1 default <all>
consul-group2-server1 alive server 1.17.3+ent 2 dc1 default <all>
consul-group2-server2 alive server 1.17.3+ent 2 dc1 default <all>
The status of consul-group2-server0
is failed
. Compare it to the status of consul-group1-server0
, which was at first marked as failed
. However, when consul-group2-server0
stepped into its role, it was ejected from the cluster by Consul Autopilot and marked as left
Recover all servers in a zone
Next, observe what happens when you recover the servers in zone0
. Execute the following command to restart the Consul server agents on both server instances.
$ ssh $GROUP1_SERVER0 "sudo systemctl start consul" && \
ssh $GROUP2_SERVER0 "sudo systemctl start consul"
Wait for a few minutes for the Consul servers to start. Aftewards, inspect the Consul cluster members state.
$ ssh $GROUP1_SERVER1 "consul members"
Node Address Status Type Build Protocol DC Partition Segment
consul-group1-server0 alive server 1.17.3+ent 2 dc1 default <all>
consul-group1-server1 alive server 1.17.3+ent 2 dc1 default <all>
consul-group1-server2 alive server 1.17.3+ent 2 dc1 default <all>
consul-group2-server0 alive server 1.17.3+ent 2 dc1 default <all>
consul-group2-server1 alive server 1.17.3+ent 2 dc1 default <all>
consul-group2-server2 alive server 1.17.3+ent 2 dc1 default <all>
All Consul server nodes are back in the cluster and their state is alive
. The next command shows the Raft peer set of the cluster and their voting status.
$ ssh $GROUP1_SERVER1 "consul operator raft list-peers"
Node ID Address State Voter RaftProtocol Commit Index Trails Leader By
consul-group1-server1 7813ce06-16dd-d06a-7343-46dd4ca51a11 leader true 3 1870 -
consul-group1-server2 dd1374e1-1eca-e75c-d9f2-19fedc102824 follower true 3 1870 0 commits
consul-group2-server1 d3054163-3822-0fb0-560f-bc6548ed40a9 follower false 3 1870 0 commits
consul-group2-server2 9d4d2fc0-80e5-df80-1ab8-c19874b0a8c0 follower false 3 1870 0 commits
consul-group2-server0 4fe23f52-dc33-7ba0-ff5a-648a842a978d follower true 3 1870 0 commits
consul-group1-server0 59b33708-1874-e62a-261d-cff58a69b3f8 follower false 3 1870 0 commits
Notice that in this case consul-group2-server0
has become the voter for zone0
, and also consul-group1-server0
has returned to the list. Finally, inspect the Consul Autopilot node roles for the cluster.
$ ssh $GROUP1_SERVER1 "consul operator autopilot state -format=json | jq -r \" (.Servers | .[] | [.Name, .RedundancyZone, .NodeType]) | @tsv \" | sort"
consul-group1-server0 zone0 zone-standby
consul-group1-server1 zone1 zone-voter
consul-group1-server2 zone2 zone-voter
consul-group2-server0 zone0 zone-voter
consul-group2-server1 zone1 zone-standby
consul-group2-server2 zone2 zone-standby
The cluster state was recovered. There are three voters and three non-voters in total. There is no priority for previous voters to return to their voting state. The first node to join the cluster in an empty zone becomes a voter, and any other nodes that join after it are treated as non-voters.
Clean up environment
Destroy the Terraform resources to clean up your environment. Enter yes
to confirm the destroy operation.
$ terraform destroy
Do you really want to destroy all resources?
Terraform will destroy all your managed infrastructure, as shown above.
There is no undo. Only 'yes' will be accepted to confirm.
Enter a value: yes
Destroy complete! Resources: 25 destroyed.
Due to race conditions with the various cloud resources created in this tutorial, you may need to run the destroy
operation twice to ensure all resources have been properly removed.
Next steps
In this tutorial you learned how to configure Consul Redundancy Zones in a pool of Consul server nodes and use them as hot standby instances in case one of the server voters fails. You observed how once a Consul server voter fails, another one from its zone is elected for the voter role.
Consul Redundancy Zones is a part of the Autopilot functionality set. To learn more about Autopilot, go to the Day 2 Operations: Autopilot tutorial next.