Mastering Zero-Downtime Deployments with Terraform

Assume you're running a live website that serves thousands of people globally. Then comes the inevitable—you must deploy an update. Without proper preparation, this might result in downtime, disgruntled users, and possibly a few furious tweets. But don't worry; Terraform has your back.

In this blog article, we'll look at the magic of zero-downtime deployments. You'll learn how to distribute upgrades to your infrastructure effortlessly and without missing a beat.

What Is Zero-Downtime Deployment?

Zero-downtime deployment means deploying updates to your infrastructure while keeping your services available to users. It's all about preventing disruption—essential for firms that value customer experience.

In Terraform, there are several ways to achieve zero-downtime deployments:

  1. Rolling Deployments
    Incrementally replace instances to avoid disrupting services. AWS Instance Refresh makes this straightforward.

  2. Blue-Green Deployments
    Deploy updates to a new environment (green) while the existing one (blue) continues serving traffic. Once verified, traffic shifts to the new environment.

  3. Canary Deployments
    Gradually introduce updates to a subset of users. Monitor performance before rolling it out to everyone.

Here’s how you can leverage Terraform for zero-downtime deployment:

AWS Instance Refresh: The Native Approach

If you're deploying on AWS, Terraform makes it simple to use the platform's native instance refresh capability, which is a game changer for executing continuous updates with no downtime. The Instance Refresh feature in AWS Auto Scaling Groups is ideal for rolling updates with no downtime. It lets you update EC2 instances incrementally, maintaining capacity and availability.

With instance refresh, AWS automatically replaces instances while maintaining the appropriate capacity throughout the process. This implies that during upgrades, the ASG will keep the appropriate amount of healthy instances, lowering the likelihood of service disruptions. You can define settings such as the minimal number of healthy instances to keep throughout the refresh to avoid loss of availability.

Step 1: Define the Launch Template

Customize your EC2 instances with user data and other configurations.

resource "aws_launch_template" "launch_template" {
  name = "my_launch_template"

  instance_type = "t2.micro"

  # user script to output the instance's hostname
  user_data = base64encode(templatefile("${path.module}/../../templates/user_data.sh", {
  server_text = "${var.server_text}"
}))
}

Step 2: Configure the Auto Scaling Group with Instance Refresh

The instance_refresh block allows rolling updates with minimal disruption.

resource "aws_autoscaling_group" "web_asg" {
  name                      = "WebServerASG"
  max_size                  = 2
  min_size                  = 1
  desired_capacity          = 1
  vpc_zone_identifier       = var.public_subnets

  launch_template {
    id      = aws_launch_template.web_template.id
    version = aws_launch_template.web_template.latest_version
  }

  instance_refresh {
    strategy = "Rolling"

    preferences {
      min_healthy_percentage = 90
    }
  }

  target_group_arns = [aws_lb_target_group.web_tg.arn]

Here’s what happens:

  1. Terraform updates the launch template.

  2. The ASG uses Instance Refresh to replace instances in batches.

  3. At least 90% of instances remain healthy during the process.

Blue-Green Deployments: Independent Environments

A blue-green deployment creates an entirely new green environment while the old blue environment handles traffic. Once the green environment has passed testing, traffic is effortlessly transferred to it.
This method is appropriate for production workloads that cannot withstand even minor disruptions. It also allows for easy reversion by simply returning traffic to the blue environment.
In this implementation, we provision two separate environments—each with its own Auto Scaling Group, Launch Template, and Target Group. A Load Balancer is used to manage traffic routing.

Directory Structure

blue-green-deployment/
├── blue/
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
├── green/
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
├── shared/
│   ├── load_balancer.tf
│   ├── variables.tf
│   ├── outputs.tf

Each environment has its own stack (blue/main.tf and green/main.tf). i.e

module "blue_asg" {
  source         = "../modules/autoscaling"
  ami            = var.blue_ami
  server_text    = "Blue Environment"
  environment    = "blue"
  public_subnets = var.public_subnets
}

output "blue_target_group_arn" {
  value = module.blue_asg.target_group_arn
}

Both environments are entirely isolated and managed separately. The ami and other inputs can differ for testing updates.

Load Balancer

The shared shared/load_balancer.tf manages traffic and directs it to the active environment.

shared/load_balancer.tf

resource "aws_lb_listener_rule" "blue_green_switch" {
  listener_arn = aws_lb_listener.alb.arn

  condition {
    field  = "host-header"
    values = ["*"]
  }

  action {
    type             = "forward"
    target_group_arn = var.blue_green_enabled == "blue" 
                        ? module.blue_asg.target_group_arn 
                        : module.green_asg.target_group_arn
  }
}

variable "blue_green_enabled" {
  description = "Specify the active environment (blue or green)"
  type        = string
  default     = "blue"
}
  1. Dynamic Selection of Target Group ARN:

    • The action block dynamically picks the target group based on the value of the variable blue_green_enabled.

    • If the variable is set to "blue", traffic is routed to the blue environment target group.

    • If it's set to "green", traffic is routed to the green environment target group.

  2. blue_green_enabled Variable:

    • This variable is used to toggle between the two environments.

Canary Deployments: Gradual Rollouts

Canary deployments gradually release updates to a limited group of users before making them available to everybody. This method reduces risk by detecting concerns early on.

Example: Canary Deployment with Terraform

Implementing canary deployments involves using weighted target groups in an Application Load Balancer (ALB).

resource "aws_lb_listener_rule" "canary_rule" {
  listener_arn = aws_lb_listener.alb.arn

  condition {
    field  = "path-pattern"
    values = ["/*"]
  }

  action {
    type = "forward"

    forward {
      target_group {
        arn    = aws_lb_target_group.canary_target_group.arn
        weight = 10
      }

      target_group {
        arn    = aws_lb_target_group.main_target_group.arn
        weight = 90
      }
    }
  }
}

In this example, 10% of traffic is routed to the canary environment, while the remaining 90% stays on the stable release.

Deployment Strategies Comparison: Blue-Green, Canary, and Rolling

Blue-Green deployment entails maintaining two identical environments (Blue and Green) and switching traffic from Blue to Green after the new version has been confirmed, resulting in minimal downtime and simple rollback but demanding large resources.

Canary deployment gradually delivers changes to a small subset of users prior to a complete rollout, reducing risk but resulting in uneven user experiences and difficult rollbacks.

Rolling deployment updates instances slowly, guaranteeing no downtime, but it may result in delayed rollback and a mix of old and new versions running concurrently.

Conclusion: When to Use Which Deployment?

  • Blue-Green: Choose Blue-Green when you require minimal downtime and can afford the costs of running two environments. It is appropriate for important applications that require a clean rollback and isolation during testing.

  • Canary: Choose Canary when you want to test new features or changes in production with minimal risk. It’s ideal for large-scale applications where you want to test updates incrementally but still ensure high availability.

  • Rolling: Use Rolling deployments for gradual updates without requiring duplicate environments. It’s ideal for infrastructure that’s already spread across multiple instances and where a slow but controlled rollout is acceptable.