Introduction
Terraform remains the dominant Infrastructure as Code tool in 2026, but the gap between "it works on my machine" and production-grade infrastructure is wider than most engineers realize. I've seen teams ship Terraform for months only to discover their state file is a ticking time bomb, their modules are copy-paste disasters, and their CI/CD pipeline has a single point of failure.
This guide distills patterns I've battle-tested across startups and enterprise environments. We'll cover state management, module architecture, workspace strategies, drift detection, and CI/CD integration — all with working code snippets you can adapt today.
State Management: Your Single Source of Truth
The Terraform state file is the most critical artifact in your infrastructure pipeline. Lose it, and you've lost the mapping between your code and real resources.
Always Use Remote State Backends
Never commit state files to git. Never store them on a developer's laptop. Use a remote backend with locking:
terraform {
backend "s3" {
bucket = "company-terraform-state-prod"
key = "core/vpc/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-locks"
encrypt = true
}
}
Key decisions:
- One bucket per environment. Never mix dev/staging/prod state in the same bucket.
- Enable versioning on the S3 bucket. It's saved my team three times.
- Use DynamoDB for locking. Prevents two engineers from running
terraform applysimultaneously and corrupting state.
State File Segmentation
Don't dump everything into one massive state file. Split by blast radius:
infrastructure/
├── core/
│ ├── vpc/
│ ├── iam/
│ └── route53/
├── data/
│ ├── rds/
│ └── elasticache/
├── compute/
│ ├── ecs/
│ └── lambda/
└── monitoring/
└── cloudwatch/
Each directory has its own state file. If you break the RDS configuration, you won't accidentally nuke the VPC.
Module Structure That Doesn't Suck
Most Terraform modules start clean and end up as spaghetti. Here's the structure I enforce:
modules/webservice/
├── main.tf # Resource definitions
├── variables.tf # Input variables with types and validation
├── outputs.tf # Outputs for consumers
├── versions.tf # Provider and Terraform version constraints
├── locals.tf # Derived values, naming conventions
└── README.md # Usage examples, required inputs
Variable Validation Is Non-Negotiable
Stop relying on documentation to tell users what values are valid. Enforce it in code:
variable "instance_type" {
type = string
description = "EC2 instance type for web servers"
validation {
condition = can(regex("^t3\\.|^c6g\\.|^m6g\\.", var.instance_type))
error_message = "instance_type must be t3, c6g, or m6g family (Graviton-compatible)."
}
}
variable "environment" {
type = string
description = "Deployment environment"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "environment must be dev, staging, or prod."
}
}
This catches misconfigurations at plan time, not at 3 AM when production is down.
Use Locals for Derived Values
Don't repeat naming logic across resources:
locals {
name_prefix = "${var.project}-${var.environment}"
common_tags = {
Project = var.project
Environment = var.environment
ManagedBy = "terraform"
CostCenter = var.cost_center
}
}
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-vpc"
})
}
Workspace Strategy: When to Use What
Terraform workspaces are controversial. Here's my pragmatic take:
Use Directory-Based Separation for Persistent Environments
For dev/staging/prod, use separate directories with shared modules:
environments/
├── dev/
│ ├── main.tf
│ └── terraform.tfvars
├── staging/
│ ├── main.tf
│ └── terraform.tfvars
└── prod/
├── main.tf
└── terraform.tfvars
This gives you full isolation and different backend configs per environment.
Use Workspaces (terraform workspace) for Ephemeral Environments
For PR previews or feature branch environments:
terraform workspace new pr-1234
terraform apply -var="environment=pr-1234"
# ... test it ...
terraform workspace select default
terraform workspace delete pr-1234
Workspaces share the same backend configuration, making them lighter weight for temporary infrastructure.
Drift Detection and Remediation
Infrastructure drifts. Someone makes a manual change in the AWS console, and suddenly your Terraform state is lying to you.
Scheduled Drift Checks
Run this in CI nightly:
terraform plan -detailed-exitcode
# Exit codes:
# 0 = no changes (clean)
# 1 = error
# 2 = changes detected (DRIFT!)
Pipe it through a notification:
# .github/workflows/drift-check.yml
name: Infrastructure Drift Detection
on:
schedule:
- cron: '0 6 * * *' # 6 AM daily
jobs:
drift-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Check for drift
id: drift
run: |
cd environments/prod
terraform init
terraform plan -detailed-exitcode
continue-on-error: true
- name: Alert on drift
if: steps.drift.outcome == 'failure'
run: |
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d '{"text":"🚨 Terraform drift detected in production!"}'
CI/CD Integration Pattern
Here's the pipeline I use for every Terraform project:
name: Terraform CI/CD
on:
pull_request:
paths:
- 'environments/**'
- 'modules/**'
jobs:
terraform:
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.9.0"
- name: Format Check
run: terraform fmt -check -recursive
- name: Validate
run: |
for dir in environments/*/; do
echo "Validating $dir"
cd "$dir"
terraform init -backend=false
terraform validate
cd -
done
- name: Plan (Staging)
if: github.ref == 'refs/heads/main'
run: |
cd environments/staging
terraform init
terraform plan -out=tfplan
- name: Apply (Staging)
if: github.ref == 'refs/heads/main'
run: |
cd environments/staging
terraform apply tfplan
Key patterns here:
- fmt check on every PR. Consistency is free with automation.
- validate without backend for speed. Just check syntax and references.
- plan on merge to main, apply manually or auto-apply staging.
Terraform Version Pinning
Always pin your Terraform and provider versions:
terraform {
required_version = "~> 1.9"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.70"
}
random = {
source = "hashicorp/random"
version = "~> 3.6"
}
}
}
The ~> operator allows patch updates but blocks major version bumps. I've been burned by provider upgrades that renamed resources behind my back.
Conclusion
Production Terraform is about discipline, not features. The patterns that matter most:
- Remote state with locking — non-negotiable
- Split state files by blast radius — protect yourself from yourself
- Variable validation — catch errors at plan time
- Directory-based environment separation — full isolation for persistent environments
- Automated drift detection — because manual changes happen
- CI/CD with fmt + validate + plan — make quality the default path
Start with these patterns and your future self (and your on-call rotation) will thank you. The tools exist to make this easy — you just need to use them before the incident, not after.