Terraform State Management at Scale
Best practices for remote state, workspaces, and state locking when managing infrastructure across multiple environments.
Terraform State Management at Scale
Terraform's state file is the source of truth for your infrastructure. When you're managing a single project, the default local state works fine. But as soon as you have multiple environments, team members, or CI/CD pipelines, state management becomes the most critical aspect of your Terraform workflow.
Remote State with Azure Storage
For Azure-based projects, I store Terraform state in Azure Blob Storage with versioning and soft delete enabled:
terraform { backend "azurerm" { resource_group_name = "rg-terraform-state" storage_account_name = "stterraformstate" container_name = "tfstate" key = "prod/networking.tfstate" }}The storage account should be provisioned outside of Terraform (via a bootstrap script) and configured with:
#!/bin/bash
az storage account create \
--name stterraformstate \
--resource-group rg-terraform-state \
--sku Standard_GRS \
--encryption-services blob \
--allow-blob-public-access false
az storage account blob-service-properties update \
--account-name stterraformstate \
--enable-versioning true \
--enable-delete-retention true \
--delete-retention-days 30
Geo-redundant storage (GRS) ensures your state survives regional outages. Versioning lets you recover from accidental state corruption.
State Locking
Azure Blob Storage provides native lease-based locking. When one terraform apply is running, another attempt will fail with a lock error rather than corrupting state. This is automatic with the azurerm backend — no additional configuration needed.
If a lock gets stuck (e.g., a CI runner crashed mid-apply), you can force-unlock:
terraform force-unlock LOCK_IDBut use this carefully — make sure no other operation is genuinely running.
Structuring State Files
One large state file for everything is an anti-pattern. I structure state by layer and environment:
environments/
prod/
networking/ # VNets, subnets, NSGs, peerings
compute/ # VMs, VMSS, AKS clusters
data/ # Databases, storage, caches
monitoring/ # Log Analytics, alerts, dashboards
staging/
networking/
compute/
data/
Each directory has its own state file. This provides:
- Blast radius reduction: A bad apply in
compute/can't destroy your network - Faster operations:
terraform planruns in seconds instead of minutes - Independent team workflows: Network team and app team don't block each other
Cross-State References with Data Sources
When state files are split, you need to share outputs between them. Use terraform_remote_state data sources:
data "terraform_remote_state" "networking" { backend = "azurerm" config = { resource_group_name = "rg-terraform-state" storage_account_name = "stterraformstate" container_name = "tfstate" key = "prod/networking.tfstate" }} resource "azurerm_kubernetes_cluster" "aks" { name = "aks-prod" location = data.terraform_remote_state.networking.outputs.location resource_group_name = data.terraform_remote_state.networking.outputs.resource_group_name default_node_pool { vnet_subnet_id = data.terraform_remote_state.networking.outputs.aks_subnet_id }}State File Hygiene
Regularly audit your state with terraform state list and clean up orphaned resources. Use terraform import to bring manually-created resources under management, and terraform state rm to remove resources that are now managed elsewhere.
A well-managed state file is the foundation of reliable infrastructure-as-code. Invest the time upfront to get it right.
Need help with your infrastructure?
Let's discuss your project and find the best solution together.
Get in touchRelated articles
Cloud Architecture Through the Shared Responsibility Model
How to design cloud systems with clear provider/customer boundaries for security, reliability, and operations.
Progressive Delivery in DevOps: Canary, Blue-Green, and Feature Flags
How to reduce deployment risk with progressive delivery patterns and measurable rollback criteria.