Multi-Datacenter & Geo-Replication

Running Kafka in a single rack or availability zone exposes you to correlated failures: a power outage, network partition, or cloud-region incident can take your entire event backbone offline. Spreading Kafka across datacenters buys you higher availability and disaster recovery, but it forces a hard choice between two fundamentally different topologies — a single stretch cluster that spans locations, or independent clusters connected by asynchronous replication. This page compares both, explains the latency and consistency trade-offs, and shows how rack awareness keeps replicas physically diverse.

Two topologies for multi-DC Kafka

There are exactly two ways to make Kafka survive the loss of a datacenter, and they differ in where the cluster boundary sits.

A stretch cluster is one logical Kafka cluster whose brokers and KRaft controllers are physically distributed across two or three locations. Replication is synchronous within the cluster, so a topic with replication.factor=3 placed across three sites means an acknowledged write already lives in all three. Failover is automatic — partition leadership simply moves to a surviving broker.

Cross-cluster replication runs two or more fully independent clusters, each self-contained, and copies records between them with MirrorMaker 2 (MM2). Replication is asynchronous, so the destination always lags by some amount. Failover is a deliberate operation: applications must be redirected to the surviving cluster, and a small window of in-flight data may be lost.

  Stretch cluster (one cluster, 3 sites)        Cross-cluster (two clusters, async)

   AZ-a      AZ-b      AZ-c                       Region A            Region B
  ┌────┐    ┌────┐    ┌────┐                    ┌─────────┐  MM2   ┌─────────┐
  │ B1 │    │ B2 │    │ B3 │                    │ cluster │ =====> │ cluster │
  └──┬─┘    └──┬─┘    └──┬─┘                    │   us    │        │   eu    │
     └─ sync replication ┘                      └─────────┘        └─────────┘
   RF=3, min.insync.replicas=2                  independent quorums, lag = ms..s

Stretch clusters

A stretch cluster is the simplest mental model: it behaves like an ordinary cluster, and consumers/producers need no special configuration. The catch is latency. Because acks=all blocks until min.insync.replicas brokers persist the record, every produce request pays the round-trip time between datacenters. This is only viable when sites are close — typically AZs within a single cloud region, where inter-AZ latency is 1-2 ms.

The classic safe layout uses three sites so a majority quorum survives losing any one. A two-site stretch cannot maintain quorum after a site failure without a tie-breaker, which is why three is strongly preferred for both KRaft controllers and data replicas.

# Per-broker server.properties — give each broker its true location
broker.rack=az-a
process.roles=broker,controller

# Topic defaults that make a stretch cluster actually durable
default.replication.factor=3
min.insync.replicas=2

Never run a stretch cluster across two datacenters only. With an even number of sites you cannot form a majority quorum after a partition, and the cluster will halt rather than risk split-brain. Use three sites, or use cross-cluster replication instead.

Rack awareness with broker.rack

Replication factor alone does not guarantee physical diversity — without help, Kafka could place all three replicas of a partition in the same rack or AZ. Setting broker.rack on each broker turns on rack-aware replica placement, which spreads the replicas of every partition across as many distinct racks as possible.

# Verify replica placement is spread across racks/AZs
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic orders

Output:

Topic: orders  PartitionCount: 3  ReplicationFactor: 3
  Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,2,3   # brokers in az-a, az-b, az-c
  Partition: 1  Leader: 2  Replicas: 2,3,1  Isr: 2,3,1
  Partition: 2  Leader: 3  Replicas: 3,1,2  Isr: 3,1,2

Rack awareness also enables follower fetching, letting a consumer read from the nearest in-sync replica instead of always the leader. This cuts cross-AZ data-transfer cost, which can dominate cloud bills.

# Broker: expose rack-aware fetching
replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector

// Consumer: declare its own rack so the broker routes it to a local follower
Properties props = new Properties();
props.put("bootstrap.servers", "broker-az-a:9092");
props.put("group.id", "order-processor");
props.put("client.rack", "az-a");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
try (KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props)) {
    consumer.subscribe(java.util.List.of("orders"));
    // poll loop...
}

Cross-cluster replication (MirrorMaker 2)

When datacenters are far apart — different cloud regions or continents — synchronous replication is too slow, so you replicate asynchronously. MM2 runs on Kafka Connect and copies topics, consumer-group offsets, and ACLs between clusters. Two patterns dominate:

Active-passive (DR): one cluster takes all traffic; MM2 mirrors it to a standby. On disaster you fail applications over to the standby.
Active-active: both clusters take local writes and mirror to each other. Topics are renamed with a source prefix (us.orders, eu.orders) to prevent infinite replication loops.

# connect-mirror-maker.properties — active-active between us and eu
clusters = us, eu
us.bootstrap.servers = broker-us:9092
eu.bootstrap.servers = broker-eu:9092

us->eu.enabled = true
us->eu.topics = orders.*
eu->us.enabled = true
eu->us.topics = orders.*

# Keep consumer offsets translated so failover resumes near where it left off
sync.group.offsets.enabled = true
emit.checkpoints.enabled = true
replication.factor = 3

Choosing a topology

Dimension	Stretch cluster	Cross-cluster (MM2)
Cluster boundary	One cluster, many sites	Independent clusters
Replication	Synchronous	Asynchronous
Distance suited to	AZs in one region (low latency)	Regions / continents
Data loss on site failure	None (committed = replicated)	Possible (replication lag)
Failover	Automatic leader election	Manual / orchestrated redirect
Site count for quorum	3 (majority)	2+ independent quorums
Operational complexity	Lower	Higher (offset translation, naming)

MM2 replicates the topic name with a source prefix by default. Producers and consumers that fail over must be configured for the renamed topic (e.g. us.orders), or use offset-translation checkpoints so the consumer resumes at the correct position.

Best Practices

Prefer a three-site stretch cluster across AZs in one region for the strongest consistency with automatic failover; use MM2 only when latency makes that impractical.
Always set broker.rack on every broker so replicas are physically diverse — replication factor without rack awareness is a false sense of safety.
Pin min.insync.replicas=2 with replication.factor=3 so a single site failure never blocks acknowledged writes nor risks silent data loss.
Enable RackAwareReplicaSelector and client.rack to keep most consumer reads local, slashing cross-AZ transfer costs.
For DR clusters, enable MM2 offset and checkpoint sync so consumers can resume on the standby without reprocessing or skipping data.
Test failover regularly — a DR plan that has never been exercised is a hypothesis, not a guarantee.
Monitor replication lag (stretch: ISR shrink/under-replicated partitions; MM2: replication-latency-ms) and alert before it threatens your recovery objectives.