Monitor Raft metrics and logs for WAL

This topic describes how to monitor Raft metrics and logs if you are testing the WAL backend. We strongly recommend monitoring the Consul cluster, especially the target server, for evidence that the WAL backend is not functioning correctly. Refer to Enable the experimental WAL LogStore backend for additional information about the WAL backend.

Upgrade warning: The WAL LogStore backend is experimental.

Monitor for checksum failures

Log store verification failures on any server, regardless of whether you are running the BoltDB or WAL backed, are unrecoverable errors. Consul may report the following errors in logs.

Read failures: Disk Corruption

2022-11-15T22:41:23.546Z [ERROR]  agent.raft.logstore: verification checksum FAILED: storage corruption rangeStart=1234 rangeEnd=3456 leaderChecksum=0xc1... readChecksum=0x45...

This indicates that the server read back data that is different from what it wrote to disk. This indicates corruption in the storage backend or filesystem.

For convenience, Consul also increments a metric consul.raft.logstore.verifier.read_checksum_failures when this occurs.

Write failures: In-flight Corruption

The following error indicates that the checksum on the follower did not match the leader when the follower received the logs. The error implies that the corruption happened in the network or software and not the log store:

2022-11-15T22:41:23.546Z [ERROR]  agent.raft.logstore: verification checksum FAILED: in-flight corruption rangeStart=1234 rangeEnd=3456 leaderChecksum=0xc1... followerWriteChecksum=0x45...

It is unlikely that this error indicates an issue with the storage backend, but you should take the same steps to resolve and report it.

The consul.raft.logstore.verifier.write_checksum_failures metric increments when this error occurs.

Resolve checksum failures

If either type of corruption is detected, complete the instructions for reverting to BoltDB. If the server already uses BoltDB, the errors likely indicate a latent bug in BoltDB or a bug in the verification code. In both cases, you should follow the revert instructions.

Report all verification failures as a GitHub issue.

In your report, include the following:

Details of your server cluster configuration and hardware
Logs around the failure message
Context for how long they have been running the configuration
Any metrics or description of the workload you have. For example, how many raft commits per second. Also include the performance metrics described on this page.

We recommend setting up an alert on Consul server logs containing verification checksum FAILED or on the consul.raft.logstore.verifier.{read|write}_checksum_failures metrics. The sooner you respond to a corrupt server, the lower the chance of any of the potential risks causing problems in your cluster.

Performance metrics

The key performance metrics to watch are:

consul.raft.commitTime measures the time to commit new writes on a quorum of servers. It should be the same or lower after deploying WAL. Even if WAL is faster for your workload and hardware, it may not be reflected in commitTime until enough followers are using WAL that the leader does not have to wait for two slower followers in a cluster of five to catch up.
consul.raft.rpc.appendEntries.storeLogs measures the time spent persisting logs to disk on each follower. It should be the same or lower for WAL-enabled followers.
consul.raft.replication.appendEntries.rpc measures the time taken for each AppendEntries RPC from the leader's perspective. If this is significantly higher than consul.raft.rpc.appendEntries on the follower, it indicates a known queuing issue in the Raft library and is unrelated to the backend. Followers with WAL enabled should not be slower than the others. You can determine which follower is associated with which metric by running the consul operator raft list-peers command and matching the peer_id label value to the server IDs listed.
consul.raft.compactLogs measures the time take to truncate the logs after a snapshot. WAL-enabled servers should not be slower than BoltDB servers.
consul.raft.leader.dispatchLog measures the time spent persisting logs to disk on the leader. It is only relevant if a WAL-enabled server becomes a leader. It should be the same or lower than before when the leader was using BoltDB.