Vault
/sys/storage/raft/autopilot
Restricted endpoint
The API path can only be called from the root namespace.The /sys/storage/raft/autopilot
endpoints are used to manage raft clusters using autopilot
with Vault's Integrated Storage backend.
Refer to the Integrated Storage Autopilot tutorial to learn how to manage raft clusters using autopilot.
Get cluster state
This endpoint is used to retrieve the raft cluster state. See the docs page for a description of the output.
Method | Path |
---|---|
GET | /sys/storage/raft/autopilot/state |
Sample request
$ curl \
--header "X-Vault-Token: ..." \
http://127.0.0.1:8200/v1/sys/storage/raft/autopilot/state
Sample response
{
"failure_tolerance": 1,
"healthy": true,
"leader": "vault_1",
"servers": {
"vault_1": {
"address": "127.0.0.1:8201",
"healthy": true,
"id": "vault_1",
"last_contact": "0s",
"last_index": 63,
"last_term": 3,
"name": "vault_1",
"node_status": "alive",
"node_type": "voter",
"stable_since": "2024-08-29T16:02:45.639829+02:00",
"status": "leader",
"version": "1.17.3"
},
"vault_2": {
"address": "127.0.0.1:8203",
"healthy": true,
"id": "vault_2",
"last_contact": "678.62575ms",
"last_index": 63,
"last_term": 3,
"name": "vault_2",
"node_status": "alive",
"node_type": "voter",
"stable_since": "2024-08-29T16:02:47.640976+02:00",
"status": "voter",
"version": "1.17.3"
},
"vault_3": {
"address": "127.0.0.1:8205",
"healthy": true,
"id": "vault_3",
"last_contact": "3.969159375s",
"last_index": 63,
"last_term": 3,
"name": "vault_3",
"node_status": "alive",
"node_type": "voter",
"stable_since": "2024-08-29T16:02:49.640905+02:00",
"status": "voter",
"version": "1.17.3"
}
},
"voters": [
"vault_1",
"vault_2",
"vault_3"
]
}
The failure_tolerance
of a cluster is the number of nodes in the cluster that could
fail gradually without causing an outage.
When verifying the health of your cluster, check the following fields of each server:
healthy
: whether Autopilot considers this node healthy or notstatus
: the voting status of the node. This will bevoter
,leader
, ornon-voter
")last_index
: the index of the last applied Raft log. This should be close to thelast_index
value of the leader.version
: the version of Vault running on the servernode_type
: the type of node. On CE, this will always bevoter
. See below for an explanation of Enterprise node types.
Enterprise only
Vault Enterprise will include additional output in its API response to indicate the current state of redundancy zones, automated upgrade progress (if any), and optimistic failure tolerance.
Sample response (Enterprise)
{
"failure_tolerance": 0,
"healthy": true,
"leader": "vault_1",
"optimistic_failure_tolerance": 3,
"redundancy_zones": {
"a": {
"servers": [
"vault_1",
"vault_2",
"vault_5"
],
"voters": [
"vault_1"
],
"failure_tolerance": 2
},
"b": {
"servers": [
"vault_3",
"vault_4"
],
"voters": [
"vault_3"
],
"failure_tolerance": 1
}
},
"upgrade_info": {
"other_version_non_voters": [
"vault_2",
"vault_4"
],
"other_version_voters": [
"vault_1",
"vault_3"
],
"redundancy_zones": {
"a": {
"target_version_non_voters": [
"vault_5"
],
"other_version_voters": [
"vault_1"
],
"other_version_non_voters": [
"vault_2"
]
},
"b": {
"other_version_voters": [
"vault_3"
],
"other_version_non_voters": [
"vault_4"
]
}
},
"status": "await-new-voters",
"target_version": "1.17.5",
"target_version_non_voters": [
"vault_5"
]
},
"voters": [
"vault_1",
"vault_3"
]
}
optimistic_failure_tolerance
describes the number of healthy active and
back-up voting servers that can fail gradually without causing an outage.
Enterprise Node Types
voter
: The server is a Raft voter and contributing to quorum.read-replica
: The server is not a Raft voter, but receives a replica of all data.zone-voter
: The main Raft voter in a redundancy zone.zone-extra-voter
: An additional Raft voter in a redundancy zone.zone-standby
: A non-voter in a redundancy zone that can be promoted to a voter, if needed.
Get configuration
This endpoint is used to get the configuration of the autopilot subsystem of Integrated Storage.
Method | Path |
---|---|
GET | /sys/storage/raft/autopilot/configuration |
Sample request
$ curl \
--header "X-Vault-Token: ..." \
http://127.0.0.1:8200/v1/sys/storage/raft/autopilot/configuration
Sample response
{
"cleanup_dead_servers": false,
"dead_server_last_contact_threshold": "24h0m0s",
"last_contact_threshold": "10s",
"max_trailing_logs": 1000,
"min_quorum": 0,
"server_stabilization_time": "10s",
"disable_upgrade_migration": true
}
Note that in the above sample response, disable_upgrade_migration
is an Enterprise-only field.
Set configuration
This endpoint is used to modify the configuration of the autopilot subsystem of Integrated Storage.
Method | Path |
---|---|
POST | /sys/storage/raft/autopilot/configuration |
Parameters
Autopilot exposes a configuration API to manage its behavior. These items cannot be set in Vault server configuration files. Autopilot gets initialized with the following default values. If these default values do not meet your expected autopilot behavior, don't forget to set them to your desired values.
cleanup_dead_servers
(bool: false)
- This controls whether to remove dead servers from the Raft peer list periodically or when a new server joins. This requires thatmin-quorum
is also set.dead_server_last_contact_threshold
(string: "24h")
- Limit on the amount of time a server can go without leader contact before being considered failed. This takes effect only whencleanup_dead_servers
is set. When adding new nodes to your cluster, thedead_server_last_contact_threshold
needs to be larger than the amount of time that it takes to load a Raft snapshot, otherwise the newly added nodes will be removed from your cluster before they have finished loading the snapshot and starting up. If you are using an HSM, yourdead_server_last_contact_threshold
needs to be larger than the response time of the HSM.
Warning
We strongly recommend keeping dead_server_last_contact_threshold
at a high
duration, such as a day, as it being too low could result in removal of nodes
that aren't actually dead
min_quorum
(int)
- The minimum number of servers that should always be present in a cluster. Autopilot will not prune servers below this number. There is no default for this value and it should be set to the expected number of voters in your cluster whencleanup_dead_servers
is set astrue
. Use the quorum size guidance to determine the proper minimum quorum size for your cluster.max_trailing_logs
(int: 1000)
- Amount of entries in the Raft Log that a server can be behind before being considered unhealthy. If this value is too low, it can cause the cluster to lose quorum if a follower falls behind. This value only needs to be increased from the default if you have a very high write load on Vault and you see that it takes a long time to promote new servers to becoming voters. This is an unlikely scenario and most users should not modify this value.last_contact_threshold
(string "10s")
- Limit on the amount of time a server can go without leader contact before being considered unhealthy.server_stabilization_time
(string "10s")
- Minimum amount of time a server must be in a healthy state before it can become a voter. Until that happens, it will be visible as a peer in the cluster, but as a non-voter, meaning it won't contribute to quorum.disable_upgrade_migration
(bool: false)
- Disables automatically upgrading Vault using autopilot (Enterprise-only)
Sample request
$ curl \
--header "X-Vault-Token: ..." \
--request POST \
--data @payload.json \
http://127.0.0.1:8200/v1/sys/storage/raft/autopilot/configuration
Sample payload
{
"cleanup_dead_servers": true,
"last_contact_threshold": "10s",
"dead_server_last_contact_threshold": "24h",
"max_trailing_logs": "1000",
"min_quorum": "3",
"server_stabilization_time": "10s",
"disable_upgrade_migration": true
}
Note that in the above sample payload, disable_upgrade_migration
is an Enterprise-only field.