Measure failover resilience

This topic describes how to measure the failover resilience for a Terraform Enterprise deployment connected to a PostgreSQL database cluster.

Overview

You can connect Terraform Enterprise to a database cluster so that the application can fail over to another database instance if there is an issue with the primary instance. Refer to the following topics for additional information:

You can test the resilience of your failover system by develping and running test workloads against the database cluster and measuring the recovery time objective (RTO).

Note that RTO varies significantly based on organizational priorities and the complexity of the services involved.

Define test workloads

You can continuously execute a workload against Terraform Enterprise and measure the time since the last successful execution of the workload. With this continuous measurement running, you can then trigger a failover on the Postgres cluster and record the outage duration.

The following sequence of steps describe an example workload:

Reset workspace to cleanup any blocking run.
Create and upload a configuration version.
Create a run.
Wait for the plan to finish.
Discard the plan.

Define a protocol

Determine what success and failure means in terms of measuring RTO for your instances. The following criteria represent an example protocol:

Execute the workloads every fifteen seconds.
1. If the workload does not report success within 10 seconds, the instance is unhealthy.
2. The instance is healthy whenfive consecutive runs complete successfully.
3. The instance is non-operational if any run fails.
Trigger a failover.
Wait until the system becomes operational.
1. If the workload does not report success within 10 seconds, the instance is unhealthy.
2. The instance is healthy whenfive consecutive runs complete successfully.
3. The instance is non-operational if any run fails.
Complete 10 iterations.

Trigger a failover

Create a separate organization and workspace to prevent modifying the initial dataset and to enable you to repeat tests. Determine a mechanism for triggering a failover. For example, if Terraform Enterprise is connected to a database cluster hosted on AWS, you can use the relational database service (RDS) to trigger a failover in the AWS console:

$ aws rds failover-db-cluster --db-cluster-identifier <CLUSTER_NAME> --region us-west-2`

Compute metrics

Compute the RTO by logging the duration between the first failed run and the first of five consecutively successful runs. You can measure RTO using go-tfe client.

Patroni example

The following table contains example data collected by running test workloads against a Terraform Enterprise deployment connected to a PostgreSQL cluster running on Patroni:

Failover	RTO	First failed run	First of five consecutive successful runs
1	0:03:42	17:16:06.275	17:19:47.832
2	-	-	Terraform Enterprise returned the operation within one minute, but runs continued to fail. Restarting all nodes resolved the issue.
3	0:04:56	17:34:10.940	17:39:07.467
4	0:02:18	18:01:50.913	18:04:08.902
5	-	18:07:30.912	Terraform Enterprise returned the operation within one minute, but runs continued to fail. Restarting all nodes resolved the issue.

Aurora example

The following table contains example data collected by running test workloads against a Terraform Enterprise deployment connected to a PostgreSQL cluster running on Aurora:

Failover	RTO	Failover start	First failed run	First of five consecutive successful runs
1	53.6s	10:13	10:13:37.539	10:14:31.186
2	61.3s	10:17	10:17:14.430	10:18:15.763
3	infinite	10:21	10:21:30.875	Terraform Enterprise is partially operational, but runs randomly fail. After restarting the nodes, the application is fully operational.
4	< 25s	11:36	No run failed. Failover succeeded in less than the measurement interval of 25s	NA
5	55.5s	11:43	11:43:12.188	11:44:07.725
6	57.5s	11:47	11:47:44.293	11:48:41.828
7	42.7s	11:51	11:51:43.751	11:52:26.485
8	infinite	12:27	12:27:16.227	Terraform Enterprise became unoperational. All nodes went down and all runs failed. Vault sealed on all three nodes. Either restart the nodes or restart the Vault process inside the nodes.
9	infinite	13:27	13:28:00.917	Terraform Enterprise is operational, but all runs failed. Restarting all nodes resolved the issue.
10	58.6s	13:50	13:50:37.778	13:51:36.330