Data Sources
Not everything your configuration depends on is managed by Terraform. The latest Amazon Linux AMI, a VPC another team owns, your AWS account ID, the availability zones in a region — these already exist, and Terraform should read them rather than create them. Data sources are the read-only counterpart to resources: they query a provider’s API at plan time and expose the result as attributes you can reference. This keeps configurations dynamic and avoids hardcoding IDs that drift over time. Data sources work identically in OpenTofu.
What a data source is
A data block describes information you want to look up, not infrastructure you want to manage. Terraform sends a read request to the provider during terraform plan (and again on apply), receives the matching object, and stores its attributes for the duration of the run. Nothing is created, updated, or destroyed — a data source never appears in Plan: N to add counts as an object Terraform owns.
The syntax mirrors a resource block but starts with the data keyword:
data "aws_ami" "amazon_linux" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["al2023-ami-*-x86_64"]
}
}
The two labels are the data source type (aws_ami) and a local name (amazon_linux) you choose. The body holds query arguments — the filters and constraints that select exactly one object. If a query matches zero or multiple results when the provider expects one, the plan fails with an error, which is a useful guardrail.
Referencing a data source
You read a data source’s attributes with the data. prefix, then type, name, and attribute:
data.<TYPE>.<NAME>.<ATTRIBUTE>
So the AMI ID found above is data.aws_ami.amazon_linux.id. Resources, by contrast, omit the data. prefix (aws_instance.web.id). Wiring the lookup into a resource argument creates an implicit dependency — Terraform reads the data source first, then uses the value:
resource "aws_instance" "web" {
ami = data.aws_ami.amazon_linux.id
instance_type = "t3.micro"
tags = {
Name = "web-server"
}
}
Now the AMI is always the most recent matching image at apply time, with no hand-copied ami-0123... string to maintain.
When to use a data source
Reach for a data source whenever you need a value that lives outside your configuration’s control:
| Scenario | Example data source | What it returns |
|---|---|---|
| Latest machine image | aws_ami | A current AMI ID matching a filter |
| Resources another team owns | aws_vpc, aws_subnets | Existing VPC/subnet IDs by tag |
| Account/region context | aws_caller_identity, aws_region | Account ID, region name |
| Available AZs | aws_availability_zones | The zone names in the region |
| Secrets from a vault | aws_secretsmanager_secret_version | A stored secret value |
| Local/remote files & APIs | local_file, http | File contents, an HTTP response body |
The unifying rule: if the object’s lifecycle is managed elsewhere — by another team, by the cloud platform, or by a separate Terraform state — read it with a data source instead of declaring a resource for it.
Data sources are read at plan time, so their results must be resolvable before apply. If a data source depends on a resource being created in the same run, that value is
(known after apply)and the read is deferred — but querying for something that does not yet exist will error. Apply foundational stacks first, then read them downstream.
Worked example
A common pattern: deploy an instance into a pre-existing VPC, on the latest image, tagged with the owning account. This combines three lookups with one managed resource.
provider "aws" {
region = "us-east-1"
}
data "aws_caller_identity" "current" {}
data "aws_vpc" "main" {
tags = {
Name = "production"
}
}
data "aws_subnets" "private" {
filter {
name = "vpc-id"
values = [data.aws_vpc.main.id]
}
tags = {
Tier = "private"
}
}
data "aws_ami" "amazon_linux" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["al2023-ami-*-x86_64"]
}
}
resource "aws_instance" "app" {
ami = data.aws_ami.amazon_linux.id
instance_type = "t3.small"
subnet_id = data.aws_subnets.private.ids[0]
tags = {
Name = "app-server"
AccountId = data.aws_caller_identity.current.account_id
}
}
Running a plan shows the data sources being read first, then the single resource to create:
Output:
data.aws_caller_identity.current: Reading...
data.aws_vpc.main: Reading...
data.aws_ami.amazon_linux: Reading...
data.aws_caller_identity.current: Read complete after 0s [id=123456789012]
data.aws_ami.amazon_linux: Read complete after 1s [id=ami-0c2b8ca1dc6cf8160]
data.aws_vpc.main: Read complete after 1s [id=vpc-0abc123def456]
data.aws_subnets.private: Reading...
data.aws_subnets.private: Read complete after 1s [id=us-east-1]
# aws_instance.app will be created
+ resource "aws_instance" "app" {
+ ami = "ami-0c2b8ca1dc6cf8160"
+ instance_type = "t3.small"
+ subnet_id = "subnet-0aa11bb22cc33dd44"
+ id = (known after apply)
}
Plan: 1 to add, 0 to change, 0 to destroy.
Notice that none of the data reads count toward the plan total — only aws_instance.app is managed. The resolved values (ami-0c2b8..., the subnet, the account ID) flow straight into the resource arguments.
Best Practices
- Use data sources for anything you do not own: shared VPCs, platform AMIs, account/region context, and secrets — never hardcode the IDs.
- Make filters specific enough to match exactly one object;
most_recent = trueplus a tight name filter avoids ambiguous or failing reads. - Reference data attributes directly (
data.type.name.attr) so Terraform builds the dependency order for you instead of using manualdepends_on. - Read cross-stack values with
terraform_remote_stateor a published output rather than re-querying the cloud when the source is another Terraform configuration. - Remember data sources hit live APIs on every plan — overly broad lookups (e.g. listing all AMIs) slow runs and can be rate-limited.
- Treat data-sourced secrets as sensitive; they land in plaintext state just like resource attributes, so secure your backend.