PostgreSQL database failover

This topic provides an overview of the PostgreSQL database failover process and associated concepts to help you understand how failover works in Terraform Enterprise. This information is only relevant when operating Terraform Enterprise in active-active or external mode and when Terraform Enterprise is connected to an external database cluster. Refer to Database connection overview for additional information.

Introduction

Configuring your Terraform Enterprise deployment to connect to a PostgreSQL database replica ensures continuous database availability when Terraform Enterprise is unable to reach the primary database. To enable failover, configure Terraform Enterprise to reconnect when a failover event occurs. Additionally, you must configure your monitoring tool or PostgreSQL software to detect failures that trigger failover operations.

Refer to Connect to a PostgreSQL cluster for instructions.

For additional guidance, refer to the following example scenarios:

Failover event stages

The following stages describe the lifecycle of a failover event.

Failure detection

Continuous health checks monitor the primary database node for signs of failure, such as connectivity issues or system crashes. When a health check fails, the monitoring tool reports that the primary node is unavailable. This involves checking predefined health metrics and thresholds to confirm the node’s failure status.

Failover operations

You should deploy an external monitoring tool that is able to promote the most up-to-date standby node to the primary node during a failover event. The tool should also enable the node to handle read and write operations.

This step may involve manually updating load balancers or connection strings used by applications. Terraform Enterprise notifies applications and clients connected to the database the change either through automatic reconfiguration or by using tools that support failover transitions.

Resolution

After addressing the cause of the failure, you can restore the failed primary node and then demote the standby node. Update the primary node with data it was unable to store while it was unavailable and direct it to rejoin the cluster.

Failover metrics

Recovery point objective (RPO) and recovery time objective (RTO) are metrics that define the resilience of your system. These metrics help you understand and plan your disaster recovery and data protection strategies.

RPO

RPO refers to the maximum acceptable amount of data loss measured in time. It describes how much data you can afford to lose in the event of a failure. For example, an RPO of 10 minutes means that, in the worst case, up to 10 minutes of data could be lost due to a failure.

Achieving a low RPO requires frequent data replication and backups, which ensures that data is consistently synchronized across primary and replica nodes. In PostgreSQL, the replication method and the frequency of backups directly affect the RPO.

The synchronous replication method results in lower RPO because it ensures that the database commits transactions to both the primary and replica nodes simultaneously. But using synchronous replication may impact performance. Asynchronous replication can provide better database performance but result in a higher RPO. Note that RPO is a function of the database configuration and unrelated to Terraform Enterprise performance.

RTO

RTO refers to the maximum acceptable amount of time to restore normal operations after a failure. It describes how quickly you recover from a failure. An RTO of 15 minutes, for instance, indicates that the system must be fully operational within 15 minutes of a failure.

In a highly-available (HA) PostgreSQL architecture, automated failover processes, efficient database recovery mechanisms, and robust monitoring tools that quickly detect and address issues can help you achieve a low RTO. The complexity of the HA topology, the effectiveness of failover mechanisms, and the speed at which data can be restored all contribute to the RTO.

High availability

In order to implement a failover strategy, your Terraform Enterprise deployment must connect to a high-availability or highly-available (HA) PostgreSQL cluster. An HA PostgreSQL cluster ensures that the database remains operational and accessible during failures. For technical details, refer to the PostgreSQL documentation on high-availability

Architecture of a PostgreSQL cluster

A common HA cluster architecture has the following components:

Primary node: The database server where write operations occur. It handles both read and write requests but focuses primarily on write transactions.
Standby nodes: These are replicas of the primary node. They receive updates from the primary node and can serve read-only queries to balance the load.
Load balancer: This component distributes incoming connections across the primary and standby nodes. The load balancer ensures that no single node becomes a bottleneck and to manage failovers seamlessly.
Failover manager: A system or tool that monitors the health of nodes and orchestrates the failover process if the primary node fails.

Replication

Replication refers to copying data from the primary node to standby nodes to ensure data redundancy and availability. Streaming replication refers to when the primary node sends a continuous stream of write-ahead logs (WAL) to standby nodes. The standby nodes then apply these logs to stay synchronized with the primary database. You can specify one of the following replication types in the synchronous_commit setting of your PostgreSQL configuration:

asynchronous: PostgreSQL commits transactions on the primary node without waiting for the standby nodes to acknowledge. This offers better performance but increases the risk of data loss.
synchronous: PostgreSQL does not commit transactions until at least one standby node has acknowledged receipt. This reduces the risk of data loss but can impact write performance due to the added latency.

Replication lag in asynchronous mode refers to the delay between when PostgreSQL commits a transaction on the primary node and when the commit is replicated to the standby nodes. This lag can impact data consistency during failover. Minimizing replication lag is crucial for ensuring that standby nodes are as up-to-date as possible with the primary node’s state.

Next steps

Refer to the following topics for guidance on implementing failover for your Terraform Enterprise deployment: