Replication & In-Sync Replicas

Replication is the mechanism that lets Kafka survive broker failures without losing data. Every partition is stored as multiple copies spread across different brokers, and Kafka tracks which copies are caught up so it can promote a healthy one when a broker dies. Understanding leaders, followers, and the in-sync replica (ISR) set is essential because the interaction between acks, min.insync.replicas, and the ISR is what ultimately determines your durability guarantees in production.

Replication factor

Each topic partition has a replication factor (RF) that controls how many copies of its data exist. With RF=3, every record written to a partition lives on three brokers. The replication factor is set per topic at creation time and bounded by the number of brokers in the cluster — you cannot have an RF larger than your broker count.

kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic orders \
  --partitions 6 --replication-factor 3

You can inspect where replicas live and which are in sync with --describe:

kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic orders

Output:

Topic: orders  Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,2,3
Topic: orders  Partition: 1  Leader: 2  Replicas: 2,3,1  Isr: 2,3,1

Leaders and followers

For each partition, exactly one replica is elected the leader; the rest are followers. All producer writes and (by default) all consumer reads go through the leader — this gives a single, ordered source of truth for the partition’s log. Followers exist purely for redundancy: they do not serve client traffic in the default configuration.

Followers stay current by continuously fetching from the leader. Each follower issues fetch requests exactly as a consumer would, pulling records starting from its current log-end offset and appending them to its own copy of the log. The leader tracks how far each follower has replicated. The high watermark — the offset up to which all in-sync replicas have copied — defines what consumers are allowed to see, ensuring readers never observe records that could be lost on failover.

The in-sync replica (ISR) set

The ISR is the subset of replicas (including the leader) that are sufficiently caught up with the leader’s log. A follower is considered in-sync if it has fetched all messages up to the leader’s log-end offset within a recent time window. The leader maintains the ISR and reports changes to the controller.

A follower is removed from the ISR when it falls behind by more than replica.lag.time.max.ms (default 30000 ms). This is a time-based check, not a message-count check: if a follower has not caught up to the leader within that window — whether because the broker is slow, GC-paused, or partitioned off — it is ejected. When the follower catches back up, it rejoins the ISR automatically.

Why time, not lag count? Older Kafka used replica.lag.max.messages, but a healthy follower could be transiently kicked during a legitimate produce burst. Time-based lag tolerates bursts while still detecting genuinely stuck brokers.

Only replicas in the ISR are eligible to become leader during a clean election. This guarantees the new leader already holds every committed record.

How acks and min.insync.replicas interact with the ISR

Durability is a contract between the producer and the broker, negotiated through two settings:

acks (producer) controls how many acknowledgements the producer waits for. With acks=all (a.k.a. acks=-1), the leader waits until all replicas currently in the ISR have replicated the record before acknowledging.
min.insync.replicas (broker/topic) sets the minimum ISR size required for a write to be accepted at all. If the ISR shrinks below this number, the leader rejects produce requests with NotEnoughReplicasException.

These two only provide a strong guarantee together. acks=all alone is weak: if the ISR has collapsed to just the leader, “all” means one replica, and a single failure loses data. min.insync.replicas=2 forces at least two in-sync copies of every committed record.

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("acks", "all");          // wait for the full ISR
props.put("enable.idempotence", "true");
props.put("retries", Integer.MAX_VALUE);

try (Producer<String, String> producer = new KafkaProducer<>(props)) {
    producer.send(new ProducerRecord<>("orders", "order-42", "CONFIRMED"));
}

Set min.insync.replicas on the topic so it applies regardless of the client:

kafka-configs.sh --bootstrap-server localhost:9092 \
  --alter --entity-type topics --entity-name orders \
  --add-config min.insync.replicas=2

In Spring for Apache Kafka, configure the producer side via application.yml:

spring:
  kafka:
    producer:
      acks: all
      properties:
        enable.idempotence: true

Replication concepts reference

Concept	What it means	Typical / default
Replication factor (RF)	Number of copies of each partition	3 in production
Leader	Replica that serves all reads/writes	one per partition
Follower	Replica that fetches from the leader	RF − 1 per partition
ISR	Replicas caught up within the lag window	starts equal to all replicas
`replica.lag.time.max.ms`	Max time a follower may lag before ISR eviction	30000 ms
High watermark	Offset replicated to all ISR members; visible to consumers	tracked per partition
`acks=all`	Producer waits for the full ISR to replicate	recommended
`min.insync.replicas`	Minimum ISR size to accept a write	2 with RF=3
`unclean.leader.election.enable`	Allow out-of-sync replica to become leader	`false` (keep it off)

The RF=3, min.insync.replicas=2 sweet spot

The standard durable configuration is RF=3 with min.insync.replicas=2 and producers using acks=all. This combination tolerates the loss of one broker with zero data loss and no write interruption: with one replica down, the ISR is still 2, which satisfies the minimum, so producers keep writing. Only when a second broker fails does the partition stop accepting writes — correctly choosing consistency over availability rather than silently risking data loss.

Avoid the RF=3, min.insync.replicas=3 trap. Setting the minimum equal to the replication factor means any single broker outage (or routine rolling restart) halts producers. Keeping the minimum one below RF gives you durability and the ability to lose a node gracefully.

Best Practices

Use RF=3 for all production topics; reserve RF=1 for throwaway or local-dev clusters only.
Pair acks=all (producer) with min.insync.replicas=2 (topic) — neither setting is sufficient alone.
Keep min.insync.replicas at most RF - 1 so you can lose a broker or perform rolling upgrades without blocking writes.
Leave unclean.leader.election.enable=false; allowing an out-of-sync replica to lead trades silent data loss for availability.
Enable idempotent producers (enable.idempotence=true) so retries under acks=all never duplicate records.
Monitor UnderReplicatedPartitions and IsrShrinksPerSec metrics — a persistently shrinking ISR signals a struggling broker, slow disks, or network issues.
Tune replica.lag.time.max.ms only deliberately; lowering it makes the ISR flap under load, raising it delays detection of stuck followers.