Change Data Capture Pattern
Most data still lives in relational databases, but the value of that data multiplies when other systems can react to it the instant it changes. Change Data Capture (CDC) turns a database into a real-time event source: instead of polling tables or bolting publish calls onto your application, you tail the database’s transaction log and stream every insert, update, and delete into Kafka. From there a single change feed can fan out to search indexes, caches, analytics warehouses, and other microservices — each consuming at its own pace without ever touching the source database. CDC is the backbone of replication, cache invalidation, and the data integration layer of modern event-driven systems.
What CDC captures and why the log matters
A naive approach to detecting changes is to poll a table for rows with an updated timestamp. This misses deletes, adds query load, introduces latency, and cannot reconstruct the exact order of operations. Log-based CDC reads the database’s write-ahead log directly — PostgreSQL’s logical replication slot, MySQL’s binlog, SQL Server’s CDC tables, or MongoDB’s oplog. Because the log is the authoritative, ordered record of every committed transaction, CDC sees every change exactly as it happened, including deletes and in-transaction ordering, with sub-second latency and negligible impact on the running database.
Each captured change becomes a Kafka record. A typical Debezium change event carries the row’s state, what kind of operation occurred, and the source position:
{
"op": "u",
"before": { "id": 1042, "status": "PENDING", "total": 99.50 },
"after": { "id": 1042, "status": "SHIPPED", "total": 99.50 },
"source": { "db": "orders", "table": "orders", "lsn": 23984710 },
"ts_ms": 1717245000123
}
The op field is c (create), u (update), d (delete), or r (read, emitted during the initial snapshot). The before/after images let consumers compute exactly what changed.
Architecture: one source, many consumers
CDC inverts the usual integration model. Rather than each downstream system querying the database, the database publishes its changes once and every consumer subscribes independently. This decouples the source from its readers and lets you add new consumers without modifying the producing service.
┌──────────────────┐
│ Source database │ (Postgres / MySQL / SQL Server)
│ + write-ahead │
│ log │
└────────┬─────────┘
│ log-based capture
▼
┌──────────────────┐
│ Debezium (Kafka │
│ Connect) │
└────────┬─────────┘
│ change events
▼
┌──────────────────┐
│ Kafka │ topic per table: orders.public.orders
└───┬────┬────┬────┘
│ │ │
┌────────┘ │ └────────────┐
▼ ▼ ▼
┌────────┐ ┌──────────┐ ┌───────────────┐
│ Search │ │ Cache │ │ Analytics / │
│ index │ │ (Redis) │ │ warehouse │
└────────┘ └──────────┘ └───────────────┘
By convention Debezium routes each table to its own topic named <server>.<schema>.<table>, keyed by the primary key so all changes to a given row land on the same partition and stay ordered.
Snapshots plus ongoing streaming
When a connector starts for the first time, the log only contains recent transactions — not the rows that already existed. CDC solves this with a two-phase approach. First it performs an initial snapshot, reading the current contents of each captured table and emitting them as r events so consumers can build complete state. Then it switches to streaming, tailing the log from the exact position the snapshot captured, so no change is missed or duplicated at the handoff. Debezium records its log offset in Kafka, so a restarted connector resumes streaming from where it left off rather than re-snapshotting. Incremental (signal-based) snapshots let you re-snapshot a table while streaming continues, avoiding the long lock-and-pause of a blocking snapshot.
Setting up a Debezium PostgreSQL connector
Debezium runs as a connector inside Kafka Connect. The configuration below captures the orders and customers tables and streams them to Kafka using PostgreSQL’s logical replication.
{
"name": "orders-cdc-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "debezium",
"database.password": "${file:/secrets:db_password}",
"database.dbname": "orders",
"topic.prefix": "orders",
"plugin.name": "pgoutput",
"table.include.list": "public.orders,public.customers",
"snapshot.mode": "initial",
"tombstones.on.delete": "true"
}
}
Register it against the Connect REST API:
curl -X POST http://localhost:8083/connectors \
-H "Content-Type: application/json" \
-d @orders-cdc-connector.json
Output:
HTTP/1.1 201 Created
{"name":"orders-cdc-connector","tasks":[],"type":"source"}
Consuming change events in Spring Boot
A downstream service listens to the change topic and projects each event into its own store — here, keeping a search index in sync.
public record OrderChange(String op, OrderRow after) {}
public record OrderRow(long id, String status, BigDecimal total) {}
@Component
public class OrderIndexer {
private final SearchIndex index;
public OrderIndexer(SearchIndex index) {
this.index = index;
}
@KafkaListener(topics = "orders.public.orders", groupId = "search-indexer")
public void onChange(OrderChange change) {
switch (change.op()) {
case "c", "u", "r" -> index.upsert(change.after());
case "d" -> index.delete(change.after().id());
default -> { /* heartbeat / schema event: ignore */ }
}
}
}
A delete in Debezium produces a delete event followed by a tombstone (a record with a null value) on the same key, which lets log-compacted topics physically remove the key. Make sure your deserializer tolerates null values so tombstones do not crash the listener.
CDC delivers at-least-once: a connector restart can replay events around the last committed offset. Always make your projections idempotent — upserts keyed by primary key and deletes that are safe to apply twice handle replays naturally.
The outbox variant
Streaming raw table changes works for replication and caches, but it leaks your internal schema to consumers and couples them to physical column names. The transactional outbox combines the reliability of CDC with a clean, intentional event contract: the application writes a purpose-built event row into an outbox table within the same transaction as the business change, and Debezium’s outbox event router captures that row and emits a domain event. Consumers depend on the published event shape, not your table layout, and you remain free to refactor the schema. See the dedicated outbox pattern page for the full implementation.
Common use cases
| Use case | What CDC provides |
|---|---|
| Search indexing | Keep Elasticsearch/OpenSearch in sync with the source of truth in real time |
| Cache invalidation | Evict or refresh Redis entries the moment the underlying row changes |
| Analytics / data lake | Stream changes into a warehouse for near-real-time reporting |
| Microservice replication | Give a service a local read replica of another service’s data |
| Audit and history | Capture an immutable log of every change for compliance |
| Legacy strangler | Mirror a legacy DB into Kafka while new services migrate off it |
Best Practices
- Prefer log-based CDC over query/timestamp polling — it captures deletes, preserves ordering, and barely touches the source database.
- Make every consumer idempotent; CDC is at-least-once, so projections must tolerate replayed and duplicated events after a restart.
- Use the row’s primary key as the Kafka key so all changes to a record stay ordered within one partition.
- Handle tombstone (null-value) records explicitly and enable log compaction on change topics that represent current state.
- Plan snapshots carefully — use incremental snapshots to backfill large tables without blocking ongoing streaming.
- Reach for the outbox variant when consumers need a stable domain contract rather than your raw, evolving table schema.
- Monitor connector lag and replication slot growth; an unconsumed Postgres slot can fill the disk on the source database.