/sys/storage/raft/autopilot

Restricted endpoint

The API path can only be called from the root namespace.

The /sys/storage/raft/autopilot endpoints are used to manage raft clusters using autopilot with Vault's Integrated Storage backend. Refer to the Integrated Storage Autopilot tutorial to learn how to manage raft clusters using autopilot.

Get cluster state

This endpoint is used to retrieve the raft cluster state. See the docs page for a description of the output.

Method	Path
`GET`	`/sys/storage/raft/autopilot/state`

Sample request

$ curl \
    --header "X-Vault-Token: ..." \
    http://127.0.0.1:8200/v1/sys/storage/raft/autopilot/state

Sample response

{
  "failure_tolerance": 1,
  "healthy": true,
  "leader": "vault_1",
  "servers": {
    "vault_1": {
      "address": "127.0.0.1:8201",
      "healthy": true,
      "id": "vault_1",
      "last_contact": "0s",
      "last_index": 63,
      "last_term": 3,
      "name": "vault_1",
      "node_status": "alive",
      "node_type": "voter",
      "stable_since": "2024-08-29T16:02:45.639829+02:00",
      "status": "leader",
      "version": "1.17.3"
    },
    "vault_2": {
      "address": "127.0.0.1:8203",
      "healthy": true,
      "id": "vault_2",
      "last_contact": "678.62575ms",
      "last_index": 63,
      "last_term": 3,
      "name": "vault_2",
      "node_status": "alive",
      "node_type": "voter",
      "stable_since": "2024-08-29T16:02:47.640976+02:00",
      "status": "voter",
      "version": "1.17.3"
    },
    "vault_3": {
      "address": "127.0.0.1:8205",
      "healthy": true,
      "id": "vault_3",
      "last_contact": "3.969159375s",
      "last_index": 63,
      "last_term": 3,
      "name": "vault_3",
      "node_status": "alive",
      "node_type": "voter",
      "stable_since": "2024-08-29T16:02:49.640905+02:00",
      "status": "voter",
      "version": "1.17.3"
    }
  },
  "voters": [
    "vault_1",
    "vault_2",
    "vault_3"
  ]
}

The failure_tolerance of a cluster is the number of nodes in the cluster that could fail gradually without causing an outage.

When verifying the health of your cluster, check the following fields of each server:

healthy: whether Autopilot considers this node healthy or not
status: the voting status of the node. This will be voter, leader, or non-voter")
last_index: the index of the last applied Raft log. This should be close to the last_index value of the leader.
version: the version of Vault running on the server
node_type: the type of node. On CE, this will always be voter. See below for an explanation of Enterprise node types.

Enterprise only

Vault Enterprise will include additional output in its API response to indicate the current state of redundancy zones, automated upgrade progress (if any), and optimistic failure tolerance.

Sample response (Enterprise)

{
  "failure_tolerance": 0,
  "healthy": true,
  "leader": "vault_1",
  "optimistic_failure_tolerance": 3,
  "redundancy_zones": {
    "a": {
      "servers": [
        "vault_1",
        "vault_2",
        "vault_5"
      ],
      "voters": [
        "vault_1"
      ],
      "failure_tolerance": 2
    },
    "b": {
      "servers": [
        "vault_3",
        "vault_4"
      ],
      "voters": [
        "vault_3"
      ],
      "failure_tolerance": 1
    }
  },
  "upgrade_info": {
    "other_version_non_voters": [
      "vault_2",
      "vault_4"
    ],
    "other_version_voters": [
      "vault_1",
      "vault_3"
    ],
    "redundancy_zones": {
      "a": {
        "target_version_non_voters": [
          "vault_5"
        ],
        "other_version_voters": [
          "vault_1"
        ],
        "other_version_non_voters": [
          "vault_2"
        ]
      },
      "b": {
        "other_version_voters": [
          "vault_3"
        ],
        "other_version_non_voters": [
          "vault_4"
        ]
      }
    },
    "status": "await-new-voters",
    "target_version": "1.17.5",
    "target_version_non_voters": [
      "vault_5"
    ]
  },
  "voters": [
    "vault_1",
    "vault_3"
  ]
}

optimistic_failure_tolerance describes the number of healthy active and back-up voting servers that can fail gradually without causing an outage.

Enterprise Node Types

voter: The server is a Raft voter and contributing to quorum.
read-replica: The server is not a Raft voter, but receives a replica of all data.
zone-voter: The main Raft voter in a redundancy zone.
zone-extra-voter: An additional Raft voter in a redundancy zone.
zone-standby: A non-voter in a redundancy zone that can be promoted to a voter, if needed.

Get configuration

This endpoint is used to get the configuration of the autopilot subsystem of Integrated Storage.

Method	Path
`GET`	`/sys/storage/raft/autopilot/configuration`

Sample request

$ curl \
    --header "X-Vault-Token: ..." \
    http://127.0.0.1:8200/v1/sys/storage/raft/autopilot/configuration

Sample response

{
  "cleanup_dead_servers": false,
  "dead_server_last_contact_threshold": "24h0m0s",
  "last_contact_threshold": "10s",
  "max_trailing_logs": 1000,
  "min_quorum": 0,
  "server_stabilization_time": "10s",
  "disable_upgrade_migration": true
}

Note that in the above sample response, disable_upgrade_migration is an Enterprise-only field.

Set configuration

This endpoint is used to modify the configuration of the autopilot subsystem of Integrated Storage.

Method	Path
`POST`	`/sys/storage/raft/autopilot/configuration`

Parameters

Autopilot exposes a configuration API to manage its behavior. These items cannot be set in Vault server configuration files. Autopilot gets initialized with the following default values. If these default values do not meet your expected autopilot behavior, don't forget to set them to your desired values.

cleanup_dead_servers (bool: false) - This controls whether to remove dead servers from the Raft peer list periodically or when a new server joins. This requires that min-quorum is also set.
dead_server_last_contact_threshold (string: "24h") - Limit on the amount of time a server can go without leader contact before being considered failed. This takes effect only when cleanup_dead_servers is set. When adding new nodes to your cluster, the dead_server_last_contact_threshold needs to be larger than the amount of time that it takes to load a Raft snapshot, otherwise the newly added nodes will be removed from your cluster before they have finished loading the snapshot and starting up. If you are using an HSM, your dead_server_last_contact_threshold needs to be larger than the response time of the HSM.

Warning

We strongly recommend keeping dead_server_last_contact_threshold at a high duration, such as a day, as it being too low could result in removal of nodes that aren't actually dead

min_quorum (int) - The minimum number of servers that should always be present in a cluster. Autopilot will not prune servers below this number. There is no default for this value and it should be set to the expected number of voters in your cluster when cleanup_dead_servers is set as true. Use the quorum size guidance to determine the proper minimum quorum size for your cluster.
max_trailing_logs (int: 1000) - Amount of entries in the Raft Log that a server can be behind before being considered unhealthy. If this value is too low, it can cause the cluster to lose quorum if a follower falls behind. This value only needs to be increased from the default if you have a very high write load on Vault and you see that it takes a long time to promote new servers to becoming voters. This is an unlikely scenario and most users should not modify this value.
last_contact_threshold (string "10s") - Limit on the amount of time a server can go without leader contact before being considered unhealthy.
server_stabilization_time (string "10s") - Minimum amount of time a server must be in a healthy state before it can become a voter. Until that happens, it will be visible as a peer in the cluster, but as a non-voter, meaning it won't contribute to quorum.
disable_upgrade_migration (bool: false) - Disables automatically upgrading Vault using autopilot (Enterprise-only)

Sample request

$ curl \
    --header "X-Vault-Token: ..." \
    --request POST \
    --data @payload.json \
    http://127.0.0.1:8200/v1/sys/storage/raft/autopilot/configuration

Sample payload

{
  "cleanup_dead_servers": true,
  "last_contact_threshold": "10s",
  "dead_server_last_contact_threshold": "24h",
  "max_trailing_logs": "1000",
  "min_quorum": "3",
  "server_stabilization_time": "10s",
  "disable_upgrade_migration": true
}

Note that in the above sample payload, disable_upgrade_migration is an Enterprise-only field.