Consul
Experimental WAL LogStore backend overview
This topic provides an overview of the WAL (write-ahead log) LogStore backend. The WAL backend is an experimental feature. Refer to Requirements for supported environments and known issues.
We do not recommend enabling the WAL backend in production without following our guide for safe testing.
WAL versus BoltDB
WAL implements a traditional log with rotating, append-only log files. WAL resolves many issues with the existing LogStore
provided by the BoltDB backend. The BoltDB LogStore
is a copy-on-write BTree, which is not optimized for append-only, write-heavy workloads.
BoltDB storage scalability issues
The existing BoltDB log store inefficiently stores append-only logs to disk because it was designed as a full key-value database. It is a single file that only ever grows. Deleting the oldest logs, which Consul does regularly when it makes new snapshots of the state, leaves free space in the file. The free space must be tracked in a freelist
so that BoltDB can reuse it on future writes. By contrast, a simple segmented log can delete the oldest log files from disk.
A burst of writes at double or triple the normal volume can suddenly cause the log file to grow to several times its steady-state size. After Consul takes the next snapshot and truncates the oldest logs, the resulting file is mostly empty space.
To track the free space, Consul must write extra metadata to disk with every write. The metadata is proportional to the amount of free pages, so after a large burst write latencies tend to increase. In some cases, the latencies cause serious performance degradation to the cluster.
To mitigate risks associated with sudden bursts of log data, Consul tries to limit lots of logs from accumulating in the LogStore. Significantly larger BoltDB files are slower to append to because the tree is deeper and freelist larger. For this reason, Consul's default options associated with snapshots, truncating logs, and keeping the log history have been aggressively set toward keeping BoltDB small rather than using disk IO optimally.
But the larger the file, the more likely it is to have a large freelist or suddenly form one after a burst of writes. For this reason, the many of Consul's default options associated with snapshots, truncating logs, and keeping the log history aggressively keep BoltDT small rather than using disk IO more efficiently.
Other reliability issues, such as raft replication capacity issues, are much simpler to solve without the performance concerns caused by storing more logs in BoltDB.
WAL approaches storage issues differently
When directly measured, WAL is more performant than BoltDB because it solves a simpler storage problem. Despite this, some users may not notice a significant performance improvement from the upgrade with the same configuration and workload. In this case, the benefit of WAL is that retaining more logs does not affect write performance. As a result, strategies for reducing disk IO with slower snapshots or for keeping logs to permit slower followers to catch up with cluster state are all possible, increasing the reliability of the deployment.
WAL quality assurance
The WAL backend has been tested thoroughly during development:
Every component in the WAL, such as metadata management, log file encoding to actual file-system interaction are abstracted so unit tests can simulate difficult-to-reproduce disk failures.
We used the application-level intelligent crash explorer (ALICE) to exhaustively simulate thousands of possible crash failure scenarios. WAL correctly recovered from all scenarios.
We ran hundreds of tests in a performance testing cluster with checksum verification enabled and did not detect data loss or corruption. We will continue testing before making WAL the default backend.
We are aware of how complex and critical disk-persistence is for your data.
We hope that many users at different scales will try WAL in their environments after upgrading to 1.15 or later and report success or failure so that we can confidently replace BoltDB as the default for new clusters in a future release.