Navigation

Infrastructure as Code interview 5 min read

Advanced Terraform Interview Questions

Advanced Terraform interviews move past syntax and probe how you operate Terraform at scale: keeping remote state safe under concurrency, refactoring without destroying live resources, splitting monolithic state, and wiring plan/apply into CI/CD with guardrails. The questions below mirror what senior platform and SRE candidates actually face, with precise, modern answers (Terraform 1.5+ and OpenTofu where it matters). Read them as talking points, then practice saying each answer in your own words.

State and locking

Why do you need remote state, and how does locking fit in?

Remote state stores terraform.tfstate in a shared backend (S3, GCS, Azure Blob, or HCP Terraform) so a team and CI share one source of truth instead of passing a local file around. Locking layers on top: it grants one writer exclusive access during the read-plan-write cycle so two concurrent applies can’t clobber each other and orphan resources.

terraform {
  backend "s3" {
    bucket       = "acme-tf-state"
    key          = "prod/network/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true # native S3 locking, TF/OpenTofu 1.10+
  }
}

Before 1.10 the S3 backend required a DynamoDB table with a LockID partition key. use_lockfile = true removes that extra resource entirely.

A lock got stuck after a crashed CI job. How do you recover?

Terraform prints the lock ID when a run is blocked. Confirm no apply is actually running, then release it with force-unlock.

terraform force-unlock 9f3c1e44-1a2b-4c8d-9e10-7f0a2b3c4d5e

Output:

Terraform state has been successfully unlocked!

Never force-unlock while a real apply is in flight — you reintroduce the exact corruption locking prevents.

Refactoring and imports

count vs for_each — when do you pick each?

Use count only for a flat number of identical instances where order never changes. Use for_each whenever items have stable identities, because the address is keyed by a map/set key rather than a positional index. Removing the middle element of a count list re-indexes everything after it, forcing needless destroy/create churn.

Aspect	`count`	`for_each`
Address	`aws_instance.web[0]`	`aws_instance.web["api"]`
Keyed by	integer index	map key / set value
Mid-list removal	re-indexes (destroys others)	only that key is removed
Conditional create	`count = var.enabled ? 1 : 0`	`for_each = var.enabled ? toset(["x"]) : []`

resource "aws_instance" "web" {
  for_each      = toset(["api", "worker", "cron"])
  ami           = data.aws_ami.al2023.id
  instance_type = "t3.micro"
  tags          = { Name = "web-${each.key}" }
}

How do you rename or restructure resources without recreating them?

Use a moved block. It records a refactor in code so Terraform updates state addresses on the next plan instead of destroying and recreating. This replaces error-prone manual terraform state mv.

moved {
  from = aws_instance.web
  to   = aws_instance.web["api"]
}

Output:

Plan: 0 to add, 0 to change, 0 to destroy.

  # aws_instance.web has moved to aws_instance.web["api"]

How do you bring existing infrastructure under Terraform?

For one-off adoption use terraform import. For repeatable, reviewable adoption use import blocks (1.5+), which run inside plan/apply and can be paired with terraform plan -generate-config-out to scaffold the resource.

import {
  to = aws_s3_bucket.logs
  id = "acme-prod-logs"
}

terraform plan -generate-config-out=generated.tf

Drift, scale, and providers

What is drift and how do you detect it?

Drift is when real infrastructure diverges from state — someone edited a security group in the console, for example. Terraform detects it during the refresh phase of a plan. In CI, run a scheduled detection job with -detailed-exitcode: 0 means no changes, 2 means drift (or pending changes), 1 is an error.

terraform plan -refresh-only -detailed-exitcode

A single state has grown to thousands of resources and plans are slow. What do you do?

Split the monolith into smaller states aligned to blast radius and change frequency (networking, data, compute). Each state gets its own backend key. Wire them together with terraform_remote_state data sources or, better, by reading immutable outputs.

data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "acme-tf-state"
    key    = "prod/network/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "app" {
  subnet_id     = data.terraform_remote_state.network.outputs.private_subnet_ids[0]
  ami           = data.aws_ami.al2023.id
  instance_type = "t3.small"
}

Tools like Terragrunt automate the per-component backend wiring and keep DRY.

How do you manage resources in two regions or accounts in one config?

Declare a default provider plus aliased providers, then reference an alias per resource or module via providers.

provider "aws" {
  region = "us-east-1"
}

provider "aws" {
  alias  = "eu"
  region = "eu-west-1"
}

resource "aws_s3_bucket" "dr" {
  provider = aws.eu
  bucket   = "acme-dr-backups"
}

Testing and policy as code

How do you test Terraform code?

There are two complementary layers. The native terraform test framework (1.6+) runs .tftest.hcl files with run blocks that assert on plan/apply output — fast and dependency-free. Terratest is a Go library for full integration tests that apply real infra, hit it (HTTP, SSH), then destroy.

# tests/bucket.tftest.hcl
run "bucket_is_named" {
  command = plan

  assert {
    condition     = aws_s3_bucket.logs.bucket == "acme-prod-logs"
    error_message = "Bucket name did not match expected value"
  }
}

How do you enforce policy across many teams?

Use policy as code so guardrails live in version control and run automatically. OPA/Conftest evaluates the plan JSON against Rego rules; Sentinel is HCP Terraform’s native engine; Checkov and tfsec scan for misconfigurations.

terraform plan -out=tfplan.bin
terraform show -json tfplan.bin > plan.json
conftest test plan.json --policy policy/

CI/CD

What does a safe Terraform pipeline look like?

The core pattern is plan on pull request, apply on merge. The PR runs fmt -check, validate, security scans, and terraform plan, posting the plan as a comment for review. After approval and merge, a protected job runs terraform apply against the saved plan artifact so apply executes exactly what was reviewed — no drift between review and execution.

terraform fmt -check -recursive
terraform validate
terraform plan -out=tfplan.bin     # on PR
terraform apply -auto-approve tfplan.bin   # on merge, protected

Always apply a saved plan file rather than re-planning at apply time. Re-planning can pick up changes that no human reviewed.

Best Practices

Keep state remote, encrypted, and locked; never commit terraform.tfstate to git.
Prefer for_each over count for anything with a stable identity to avoid index churn.
Refactor with moved and adopt resources with import blocks so changes are reviewable in code.
Run scheduled drift detection with plan -detailed-exitcode and alert on exit code 2.
Split large states by blast radius and wire components via remote-state outputs.
Gate every apply behind plan-on-PR, automated policy checks, and an applied plan artifact.

Terraform Interview Questions