Replication & In-Sync Replicas
Replication is the mechanism that lets Kafka survive broker failures without losing data. Every partition is stored as multiple copies spread across different brokers, and Kafka tracks which copies are caught up so it can promote a healthy one when a broker dies. Understanding leaders, followers, and the in-sync replica (ISR) set is essential because the interaction between acks, min.insync.replicas, and the ISR is what ultimately determines your durability guarantees in production.
Replication factor
Each topic partition has a replication factor (RF) that controls how many copies of its data exist. With RF=3, every record written to a partition lives on three brokers. The replication factor is set per topic at creation time and bounded by the number of brokers in the cluster — you cannot have an RF larger than your broker count.
kafka-topics.sh --bootstrap-server localhost:9092 \
--create --topic orders \
--partitions 6 --replication-factor 3
You can inspect where replicas live and which are in sync with --describe:
kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --topic orders
Output:
Topic: orders Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Topic: orders Partition: 1 Leader: 2 Replicas: 2,3,1 Isr: 2,3,1
Leaders and followers
For each partition, exactly one replica is elected the leader; the rest are followers. All producer writes and (by default) all consumer reads go through the leader — this gives a single, ordered source of truth for the partition’s log. Followers exist purely for redundancy: they do not serve client traffic in the default configuration.
Followers stay current by continuously fetching from the leader. Each follower issues fetch requests exactly as a consumer would, pulling records starting from its current log-end offset and appending them to its own copy of the log. The leader tracks how far each follower has replicated. The high watermark — the offset up to which all in-sync replicas have copied — defines what consumers are allowed to see, ensuring readers never observe records that could be lost on failover.
The in-sync replica (ISR) set
The ISR is the subset of replicas (including the leader) that are sufficiently caught up with the leader’s log. A follower is considered in-sync if it has fetched all messages up to the leader’s log-end offset within a recent time window. The leader maintains the ISR and reports changes to the controller.
A follower is removed from the ISR when it falls behind by more than replica.lag.time.max.ms (default 30000 ms). This is a time-based check, not a message-count check: if a follower has not caught up to the leader within that window — whether because the broker is slow, GC-paused, or partitioned off — it is ejected. When the follower catches back up, it rejoins the ISR automatically.
Why time, not lag count? Older Kafka used
replica.lag.max.messages, but a healthy follower could be transiently kicked during a legitimate produce burst. Time-based lag tolerates bursts while still detecting genuinely stuck brokers.
Only replicas in the ISR are eligible to become leader during a clean election. This guarantees the new leader already holds every committed record.
How acks and min.insync.replicas interact with the ISR
Durability is a contract between the producer and the broker, negotiated through two settings:
acks(producer) controls how many acknowledgements the producer waits for. Withacks=all(a.k.a.acks=-1), the leader waits until all replicas currently in the ISR have replicated the record before acknowledging.min.insync.replicas(broker/topic) sets the minimum ISR size required for a write to be accepted at all. If the ISR shrinks below this number, the leader rejects produce requests withNotEnoughReplicasException.
These two only provide a strong guarantee together. acks=all alone is weak: if the ISR has collapsed to just the leader, “all” means one replica, and a single failure loses data. min.insync.replicas=2 forces at least two in-sync copies of every committed record.
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("acks", "all"); // wait for the full ISR
props.put("enable.idempotence", "true");
props.put("retries", Integer.MAX_VALUE);
try (Producer<String, String> producer = new KafkaProducer<>(props)) {
producer.send(new ProducerRecord<>("orders", "order-42", "CONFIRMED"));
}
Set min.insync.replicas on the topic so it applies regardless of the client:
kafka-configs.sh --bootstrap-server localhost:9092 \
--alter --entity-type topics --entity-name orders \
--add-config min.insync.replicas=2
In Spring for Apache Kafka, configure the producer side via application.yml:
spring:
kafka:
producer:
acks: all
properties:
enable.idempotence: true
Replication concepts reference
| Concept | What it means | Typical / default |
|---|---|---|
| Replication factor (RF) | Number of copies of each partition | 3 in production |
| Leader | Replica that serves all reads/writes | one per partition |
| Follower | Replica that fetches from the leader | RF − 1 per partition |
| ISR | Replicas caught up within the lag window | starts equal to all replicas |
replica.lag.time.max.ms | Max time a follower may lag before ISR eviction | 30000 ms |
| High watermark | Offset replicated to all ISR members; visible to consumers | tracked per partition |
acks=all | Producer waits for the full ISR to replicate | recommended |
min.insync.replicas | Minimum ISR size to accept a write | 2 with RF=3 |
unclean.leader.election.enable | Allow out-of-sync replica to become leader | false (keep it off) |
The RF=3, min.insync.replicas=2 sweet spot
The standard durable configuration is RF=3 with min.insync.replicas=2 and producers using acks=all. This combination tolerates the loss of one broker with zero data loss and no write interruption: with one replica down, the ISR is still 2, which satisfies the minimum, so producers keep writing. Only when a second broker fails does the partition stop accepting writes — correctly choosing consistency over availability rather than silently risking data loss.
Avoid the RF=3, min.insync.replicas=3 trap. Setting the minimum equal to the replication factor means any single broker outage (or routine rolling restart) halts producers. Keeping the minimum one below RF gives you durability and the ability to lose a node gracefully.
Best Practices
- Use
RF=3for all production topics; reserveRF=1for throwaway or local-dev clusters only. - Pair
acks=all(producer) withmin.insync.replicas=2(topic) — neither setting is sufficient alone. - Keep
min.insync.replicasat mostRF - 1so you can lose a broker or perform rolling upgrades without blocking writes. - Leave
unclean.leader.election.enable=false; allowing an out-of-sync replica to lead trades silent data loss for availability. - Enable idempotent producers (
enable.idempotence=true) so retries underacks=allnever duplicate records. - Monitor
UnderReplicatedPartitionsandIsrShrinksPerSecmetrics — a persistently shrinking ISR signals a struggling broker, slow disks, or network issues. - Tune
replica.lag.time.max.msonly deliberately; lowering it makes the ISR flap under load, raising it delays detection of stuck followers.