Multi-Datacenter & Geo-Replication
Running Kafka in a single rack or availability zone exposes you to correlated failures: a power outage, network partition, or cloud-region incident can take your entire event backbone offline. Spreading Kafka across datacenters buys you higher availability and disaster recovery, but it forces a hard choice between two fundamentally different topologies — a single stretch cluster that spans locations, or independent clusters connected by asynchronous replication. This page compares both, explains the latency and consistency trade-offs, and shows how rack awareness keeps replicas physically diverse.
Two topologies for multi-DC Kafka
There are exactly two ways to make Kafka survive the loss of a datacenter, and they differ in where the cluster boundary sits.
A stretch cluster is one logical Kafka cluster whose brokers and KRaft controllers are physically distributed across two or three locations. Replication is synchronous within the cluster, so a topic with replication.factor=3 placed across three sites means an acknowledged write already lives in all three. Failover is automatic — partition leadership simply moves to a surviving broker.
Cross-cluster replication runs two or more fully independent clusters, each self-contained, and copies records between them with MirrorMaker 2 (MM2). Replication is asynchronous, so the destination always lags by some amount. Failover is a deliberate operation: applications must be redirected to the surviving cluster, and a small window of in-flight data may be lost.
Stretch cluster (one cluster, 3 sites) Cross-cluster (two clusters, async)
AZ-a AZ-b AZ-c Region A Region B
┌────┐ ┌────┐ ┌────┐ ┌─────────┐ MM2 ┌─────────┐
│ B1 │ │ B2 │ │ B3 │ │ cluster │ =====> │ cluster │
└──┬─┘ └──┬─┘ └──┬─┘ │ us │ │ eu │
└─ sync replication ┘ └─────────┘ └─────────┘
RF=3, min.insync.replicas=2 independent quorums, lag = ms..s
Stretch clusters
A stretch cluster is the simplest mental model: it behaves like an ordinary cluster, and consumers/producers need no special configuration. The catch is latency. Because acks=all blocks until min.insync.replicas brokers persist the record, every produce request pays the round-trip time between datacenters. This is only viable when sites are close — typically AZs within a single cloud region, where inter-AZ latency is 1-2 ms.
The classic safe layout uses three sites so a majority quorum survives losing any one. A two-site stretch cannot maintain quorum after a site failure without a tie-breaker, which is why three is strongly preferred for both KRaft controllers and data replicas.
# Per-broker server.properties — give each broker its true location
broker.rack=az-a
process.roles=broker,controller
# Topic defaults that make a stretch cluster actually durable
default.replication.factor=3
min.insync.replicas=2
Never run a stretch cluster across two datacenters only. With an even number of sites you cannot form a majority quorum after a partition, and the cluster will halt rather than risk split-brain. Use three sites, or use cross-cluster replication instead.
Rack awareness with broker.rack
Replication factor alone does not guarantee physical diversity — without help, Kafka could place all three replicas of a partition in the same rack or AZ. Setting broker.rack on each broker turns on rack-aware replica placement, which spreads the replicas of every partition across as many distinct racks as possible.
# Verify replica placement is spread across racks/AZs
kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --topic orders
Output:
Topic: orders PartitionCount: 3 ReplicationFactor: 3
Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3 # brokers in az-a, az-b, az-c
Partition: 1 Leader: 2 Replicas: 2,3,1 Isr: 2,3,1
Partition: 2 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2
Rack awareness also enables follower fetching, letting a consumer read from the nearest in-sync replica instead of always the leader. This cuts cross-AZ data-transfer cost, which can dominate cloud bills.
# Broker: expose rack-aware fetching
replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector
// Consumer: declare its own rack so the broker routes it to a local follower
Properties props = new Properties();
props.put("bootstrap.servers", "broker-az-a:9092");
props.put("group.id", "order-processor");
props.put("client.rack", "az-a");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
try (KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props)) {
consumer.subscribe(java.util.List.of("orders"));
// poll loop...
}
Cross-cluster replication (MirrorMaker 2)
When datacenters are far apart — different cloud regions or continents — synchronous replication is too slow, so you replicate asynchronously. MM2 runs on Kafka Connect and copies topics, consumer-group offsets, and ACLs between clusters. Two patterns dominate:
- Active-passive (DR): one cluster takes all traffic; MM2 mirrors it to a standby. On disaster you fail applications over to the standby.
- Active-active: both clusters take local writes and mirror to each other. Topics are renamed with a source prefix (
us.orders,eu.orders) to prevent infinite replication loops.
# connect-mirror-maker.properties — active-active between us and eu
clusters = us, eu
us.bootstrap.servers = broker-us:9092
eu.bootstrap.servers = broker-eu:9092
us->eu.enabled = true
us->eu.topics = orders.*
eu->us.enabled = true
eu->us.topics = orders.*
# Keep consumer offsets translated so failover resumes near where it left off
sync.group.offsets.enabled = true
emit.checkpoints.enabled = true
replication.factor = 3
Choosing a topology
| Dimension | Stretch cluster | Cross-cluster (MM2) |
|---|---|---|
| Cluster boundary | One cluster, many sites | Independent clusters |
| Replication | Synchronous | Asynchronous |
| Distance suited to | AZs in one region (low latency) | Regions / continents |
| Data loss on site failure | None (committed = replicated) | Possible (replication lag) |
| Failover | Automatic leader election | Manual / orchestrated redirect |
| Site count for quorum | 3 (majority) | 2+ independent quorums |
| Operational complexity | Lower | Higher (offset translation, naming) |
MM2 replicates the topic name with a source prefix by default. Producers and consumers that fail over must be configured for the renamed topic (e.g.
us.orders), or use offset-translation checkpoints so the consumer resumes at the correct position.
Best Practices
- Prefer a three-site stretch cluster across AZs in one region for the strongest consistency with automatic failover; use MM2 only when latency makes that impractical.
- Always set
broker.rackon every broker so replicas are physically diverse — replication factor without rack awareness is a false sense of safety. - Pin
min.insync.replicas=2withreplication.factor=3so a single site failure never blocks acknowledged writes nor risks silent data loss. - Enable
RackAwareReplicaSelectorandclient.rackto keep most consumer reads local, slashing cross-AZ transfer costs. - For DR clusters, enable MM2 offset and checkpoint sync so consumers can resume on the standby without reprocessing or skipping data.
- Test failover regularly — a DR plan that has never been exercised is a hypothesis, not a guarantee.
- Monitor replication lag (stretch: ISR shrink/under-replicated partitions; MM2:
replication-latency-ms) and alert before it threatens your recovery objectives.