Vault
Troubleshoot and tune enterprise replication
This guide covers how to troubleshoot and tune Vault Enterprise replication between primary and secondary clusters. More specifically, you will learn how to keep Vault streaming write ahead logs (WAL) and reduce the number of Merkle syncs.
This guide presumes that the primary Vault cluster is stable, and isn't experiencing frequent leadership changes. For integrated storage users, review the performance multiplier documentation for tuning leader elections.
Personas
The troubleshooting and tuning described here is typically performed by a security operations engineer.
Disks and networking
The first consideration for optimizing replication is to ensure the underlying hardware used by Vault can handle the workload. Ultimately this means ensuring proper disk and network performance.
Tip
A good rule of thumb is server hardware should be the same for primary and secondary clusters. Performance replication can require more or less hardware depending on how it's used. Remember that performance replicas do not receive tokens and leases from the primary cluster. This means that the amount of data replicated depends on the number of secret engines and namespaces on the primary cluster.
Disks
The disks used by the Vault storage backend are critical for stable replication. Typically, you measure disk throughput in input/output operations per second (IOPS). Vault's storage backend requires enough IOPS to efficiently write changes to disk, and service other requests such as read operations.
No recommended baseline IOPS level exists because the baseline is ultimately dependent on the scale at which you use Vault. Telemetry from Vault can give insight into how well disks are performing. The two storage backends officially supported by Vault, Integrated Storage and Consul, feature specific telemetry metrics for storage:
Metric | Storage | Description |
---|---|---|
vault.raft-storage.transaction | Integrated Storage | Time to insert operations into a single log. Typically this value should be lower than 50ms per transaction. |
vault.raft-storage.get | Integrated Storage | Time to read values for a path from storage. Typically this value should be lower than 50ms per read. |
vault.raft.fsm.applyBatch | Integrated Storage | Time to apply a batch of logs. Typically this value should be lower than 50ms. |
vault.consul.transaction | Consul | Time to insert operations into a single log. Typically this value should be lower than 50ms per transaction. |
vault.consul.get | Consul | Time to read values for a path from storage. Typically this value should be lower than 50ms per read. |
Tip
If available, it can also be helpful to check disk specific telemetry (such as CloudWatch in AWS), and compare them with Vault telemetry.
When working with disks in the cloud, it's important to understand the guaranteed performance of the disks used by Vault. Cloud providers generally offer several types of disks for different workloads, and certain disks have their performance throttled. When choosing a disk for Vault, pay attention to the guaranteed IOPS of the disk, and ensure that the disk will not be automatically throttled.
Network latency
Replication between clusters occurs over a network and the speed of that network can impact its performance. When replicating to a secondary cluster, the network connecting the Vault clusters should have low latency. Besides limiting how fast WAL stream between clusters, the latency of the network can affect other parts of replication:
- Merkle diff sync requires several round trips as the secondary determines what keys are conflicting.
- Forwarding write requests adds additional overhead to connections.
It's not always possible to have low latency, and more so when Vault is replicating across large geographic distances. In situations where low latency networking is not an option, refer to the log shipper tuning guidance below.
Log shipper tuning
Log shipper is the part of Vault replication responsible for shipping write-ahead-logs (WAL) to secondary clusters. Vault maintains two log shippers: one for disaster recovery, and one for performance replication secondaries.
The log shippers are an in-memory buffer containing the last WAL and the hashed value of the Merkle tree root before applying that WAL. Each WAL can contain up to 62 batched changes. As the primary inserts data into its Merkle tree, it will also prepend the corresponding WAL to the log shipper buffer. Once the buffer is full, the last WAL in the buffer gets removed upon addition of new WAL entries.
When Vault is properly operating, replication will be in a stream-wals mode, where WAL stream to the secondary and changes get applied in near real time. If a secondary cluster falls behind while streaming WAL, it may eventually request WAL entries for a Merkle root that is no longer in the buffer.
Once this occurs, the secondary goes into merkle-diff mode to find the keys that are different between the clusters. When completed, the secondary cluster enters merkle-sync mode, which requests the changes from the primary and applies them locally. After syncing all the conflicting keys, the secondary active node will enter stream-wals mode again and changes stream.
The log shipper has a configurable length for the number of WAL it can hold at any given time. By default, the log shipper can hold 16384 WAL entries. The log shipper's total size is also capped to avoid consuming too much memory on a system. The default size is 10% of the total memory available on the server. Each log shipper (one for disaster recovery replication, one for performance replication) will split the configured size.
Note
Log shippers allocate 10% of the server's memory regardless whether they're used.
Tuning log shipper buffer length
The logshipper_buffer_length replication setting configures the number of WAL entries the log shipper can contain at any given time. By default, the length is 16364 entries, but you would need to tune this value depending on how you use Vault.
To begin tuning log shipper, it's advised to increase the log level on the secondary cluster to debug.
Debug level logs have more replication related entries and contain helpful information used to tune the log shipper. While logs from both the primary and secondary cluster are useful, the most helpful logs are typically those from the secondary cluster. Review the Vault documentation for more information on changing the log level.
Once the Vault server logs are at the debug level, search for Merkle sync logs containing num_conflict_keys.
For example:
[DEBUG] replication: starting merkle sync: num_conflict_keys=129387
A good rule of thumb is the logshipper_buffer_length should be greater than the number of conflicting keys. Several of these log entries can appear in the log depending on how often Merkle sync is running. Generally, the largest value is a good starting point. It's important to note that log shipper values need configuring on the cluster shipping WAL, and this requires restart of the Vault servers in the cluster to affect the change.
In the example log line above, your configuration might resemble this example:
replication {
logshipper_buffer_length = 130000
}
Another option is to set logshipper_buffer_length
to some arbitrary high value such as one million. This is safe due to the logshipper_buffer_size
configuration which caps how much memory the log shipper buffers can use as seen later in this guide. If replication performance improves, this value can be gradually dialed back until you find an appropriate value.
Vault telemetry has some useful metrics to check log shipper performance:
Metric | Description |
---|---|
vault.logshipper.streamWALs.missing_guard | The number of times a secondary requested a WAL entry that was not found in the log shipper. This can mean that the logshipper_buffer_length value needs increasing. |
vault.logshipper.streamWALs.scanned_entries | The number of entries in the log shipper scanned before the finding the right entry. This can be useful to learn on average how far behind a secondary is. |
Tuning logshipper buffer size
The logshipper_buffer_size replication setting caps how much memory the logshipper can consume. By default, Vault will allow the log shippers to consume 10% of the total memory found on the server. Each log shipper will use 5% of the available memory. For most situations, this shouldn't require tuning due to the wide availability of memory on servers. You can think of this as a safety net to prevent a logshipper buffer length from being over tuned and running the server out of memory.
The following metrics are useful to track how much memory log shipper is consuming:
Metric | Description |
---|---|
vault.logshipper.buffer.length | The current size in bytes of the log shipper buffer. |
vault.logshipper.buffer.max_length | The maximum configured length of the log shipper buffer. |
Client behavior
In some situations, clients can generate a lot of WAL to replicate, and this can cause the secondary to lag behind until it eventually needs to resynchronize. The following examples can help you optimize how clients use Vault to decrease the amount of WAL that need replicating.
Token creation, renewals and revocations
Clients generating tokens often (for example several times a second) can cause a lot of DR replicated changes because each token will have a lease associated with it. As leases expire or renew, these cause even more changes to the Merkle tree, and can make it difficult for secondaries to apply all the changes on busy systems.
Once a token lease expires, it gets revoked, which can cause even more changes that need replication. When possible, clients should generate tokens and renew the lease associated with the token.
A good rule of thumb is to renew a token lease at about 80% of the lease's time-to-live (TTL) value.
Local secret engines
You can specify that certain Vault mounts are local. When you specify that a mount is local, it will not replicate. Depending on how users interact with Vault, a secondary may not need all secret mounts replicated, which can reduce the amount of WAL shipped to secondary clusters.
Note
Auth methods do not replicate much data themselves. It's unlikely that setting an auth method to local will impact replication performance. Token creation and leases are much more prone to affect replication if abused.
Vault 1.13 enhancements
If keys within Vault update often (several times a second for example), this can sometimes cause Merkle sync to fail. This is because the keys are changing so often that the Merkle trees can never match. Some examples of updating keys within Vault (but are not limited to):
Updating values in KV secret
Adding/removing users to auth method roles
Changing values in plugin configurations
And updating identity entities to include/remove users
Vault 1.13 introduced changes to add extra resiliency to log shipping by adding undo logs. These logs can help prevent several Merkle syncs from occurring due to rapid key changes in the primary Merkle tree as the secondary tries to synchronize. For integrated storage users, upgrading to 1.13 will enable this feature by default. For Consul storage users, Consul also needs upgrading to 1.14 to use this feature.