Autopilot

Autopilot enables automated workflows for managing Raft clusters. The current feature set includes 3 main features: Server Stabilization, Dead Server Cleanup and State API. These three features were introduced in Vault 1.7. The Enterprise feature set includes 2 main features: Automated Upgrades and Redundancy Zones. These two features were introduced in Vault 1.11.

Server stabilization

Server stabilization helps to retain the stability of the Raft cluster by safely joining new voting nodes to the cluster. When a new voter node is joined to an existing cluster, autopilot adds it as a non-voter instead, and waits for a pre-configured amount of time to monitor its health. If the node remains healthy for the entire duration of stabilization, then that node will be promoted as a voter. The server stabilization period can be tuned using server_stabilization_time (see below).

Dead server cleanup

Dead server cleanup automatically removes nodes deemed unhealthy from the Raft cluster, avoiding the manual operator intervention. This feature can be tuned using the cleanup_dead_servers, dead_server_last_contact_threshold, and min_quorum (see below).

State API

The State API provides detailed information about all the nodes in the Raft cluster in a single call. This API can be used for monitoring for cluster health.

Follower health

Follower node health is determined by 2 factors.

Its ability to heartbeat to leader node at regular intervals. Tuned using last_contact_threshold (see below).
Its ability to keep up with data replication from the leader node. Tuned using max_trailing_logs (see below).

Default configuration

By default, Autopilot will be enabled with clusters using Vault 1.7+, although dead server cleanup is not enabled by default. Upgrade of Raft clusters deployed with older versions of Vault will also transition to use Autopilot automatically.

Autopilot exposes a configuration API to manage its behavior. These items cannot be set in Vault server configuration files. Autopilot gets initialized with the following default values. If these default values do not meet your expected autopilot behavior, don't forget to set them to your desired values.

cleanup_dead_servers (bool: false) - This controls whether to remove dead servers from the Raft peer list periodically or when a new server joins. This requires that min-quorum is also set.
dead_server_last_contact_threshold (string: "24h") - Limit on the amount of time a server can go without leader contact before being considered failed. This takes effect only when cleanup_dead_servers is set. When adding new nodes to your cluster, the dead_server_last_contact_threshold needs to be larger than the amount of time that it takes to load a Raft snapshot, otherwise the newly added nodes will be removed from your cluster before they have finished loading the snapshot and starting up. If you are using an HSM, your dead_server_last_contact_threshold needs to be larger than the response time of the HSM.

Warning

We strongly recommend keeping dead_server_last_contact_threshold at a high duration, such as a day, as it being too low could result in removal of nodes that aren't actually dead

min_quorum (int) - The minimum number of servers that should always be present in a cluster. Autopilot will not prune servers below this number. There is no default for this value and it should be set to the expected number of voters in your cluster when cleanup_dead_servers is set as true. Use the quorum size guidance to determine the proper minimum quorum size for your cluster.
max_trailing_logs (int: 1000) - Amount of entries in the Raft Log that a server can be behind before being considered unhealthy. If this value is too low, it can cause the cluster to lose quorum if a follower falls behind. This value only needs to be increased from the default if you have a very high write load on Vault and you see that it takes a long time to promote new servers to becoming voters. This is an unlikely scenario and most users should not modify this value.
last_contact_threshold (string "10s") - Limit on the amount of time a server can go without leader contact before being considered unhealthy.
server_stabilization_time (string "10s") - Minimum amount of time a server must be in a healthy state before it can become a voter. Until that happens, it will be visible as a peer in the cluster, but as a non-voter, meaning it won't contribute to quorum.
disable_upgrade_migration (bool: false) - Disables automatically upgrading Vault using autopilot (Enterprise-only)

Note: Autopilot in Vault does similar things to what autopilot does in Consul. However, the configuration in these 2 systems differ in terms of default values and thresholds; some additional parameters might also show up in Vault in comparison to Consul as well. Autopilot in Vault and Consul use different technical underpinnings requiring these differences, to provide the autopilot functionality.

Automated upgrades

Automated Upgrades lets you automatically upgrade a cluster of Vault nodes to a new version as updated server nodes join the cluster. Once the number of nodes on the new version is equal to or greater than the number of nodes on the old version, Autopilot will promote the newer versioned nodes to voters, demote the older versioned nodes to non-voters, and initiate a leadership transfer from the older version leader to one of the newer versioned nodes. After the leadership transfer completes, the older versioned non-voting nodes can be removed from the cluster.

Redundancy zones

Redundancy Zones provide both scaling and resiliency benefits by deploying non-voting nodes alongside voting nodes on a per availability zone basis. When using redundancy zones, each zone will have exactly one voting node and as many additional non-voting nodes as desired. If the voting node in a zone fails, a non-voting node will be automatically promoted to voter. If an entire zone is lost, a non-voting node from another zone will be promoted to voter, maintaining quorum. These non-voting nodes function not only as hot standbys, but also increase read scalability.

Replication

DR secondary and Performance secondary clusters have their own Autopilot configurations, managed independently of their primary.

The Autopilot API uses DR operation tokens for authorization when executed against a DR secondary cluster.

Tutorial

Refer to the following tutorials to learn more.