Vault
Redundancy zones
Enterprise
Appropriate Vault Enterprise license required
Vault Enterprise Redundancy Zones provide both read scaling and resiliency benefits by enabling the deployment of non-voting nodes alongside voting nodes on a per availability zone basis.
When using redundancy zones, if an operator chooses to deploy Vault across three availability zones, they could have two (or more) nodes (one voting/one+ non-voting) in each zone. In the event that a voting node in an availability zone fails, the redundancy zone configuration automatically promotes the non-voting node to a voting node. In the event that an entire availability zone is lost, a non-voting node in one of the existing availability zones would be promoted to a voting node, keeping quorum. This capability functions as a "hot standby" for server nodes while also providing enhanced read scalability.
Configuration
A new key can be added to Vault's storage
configuration stanza: autopilot_redundancy_zone
.
The value for this key is a string of your choosing and represents the zone this particular node
should be in.
Mechanics
Vault's Autopilot subsystem will always attempt to maintain exactly one voting node per redundancy zone. Any additional nodes beyond the first one will be demoted to non-voting status. Non-voting nodes can serve reads but can not participate in cluster elections.
If redundancy zones are used in conjunction with automated upgrades, Autopilot will always try to ensure that Vault is never moving from a more healthy state to a less healthy state. Autopilot will wait to begin leadership transfer until it can ensure that there will be as much redundancy on the new Vault version as there was on the old Vault version.
The status of redundancy zones can be monitored by consulting the Autopilot state API endpoint.
Optimistic Failure Tolerance
The majority of the voting servers in a cluster need to be available to agree on changes in configuration. If a voting node becomes unavailable and that causes the cluster to have fewer voting nodes than the quorum size, then Autopilot will not be able to promote a non-voter to become a voter. This is the failure tolerance of the cluster. Redundancy zones are not able to improve the failure tolerance of a cluster.
Say that you have a cluster configured to have 2 redundancy zones and each zone has 2 servers within it (for total of 4 nodes in the cluster). The quorum size is 2. If the zone voter in either of the redundancy zones becomes unavailable, the cluster does not have quorum and is not able to agree on the configuration change needed to promote the non-voter in the zone into a voter.
Redundancy zones do improve the optimistic failure tolerance of a cluster. The optimistic failure tolerance is the number of healthy active and back-up voting servers that can fail gradually without causing an outage. If the Vault cluster is able to maintain a quorum of voting nodes, then the cluster has the capability to lose nodes gradually and promote the standby redundancy zone nodes to take the place of voters.
For example, consider a cluster that is configured to have 3 redundancy zones with 2 nodes in each zone. If a voting node becomes unreachable, the zone standby in that zone is promoted. The cluster then maintains 3 voting nodes with 2 remaining standbys. The cluster can handle an additional 2 gradual failures before it loses quorum.