Durability & Replication Tuning
Durability in Kafka is not a single switch — it emerges from the interaction of four settings that live in different places: the topic’s replication factor, the broker/topic min.insync.replicas, the producer’s acks, and whether unclean leader election is allowed. Set any one of them wrong and the other three can quietly lull you into a false sense of safety. This page explains how these knobs combine, walks through what happens when brokers fail, and gives you the battle-tested durable configuration to copy.
The four knobs that define durability
Every committed record in Kafka lives on a partition that is replicated across a set of brokers. One replica is the leader; the rest are followers that fetch from it. The subset of replicas that are fully caught up is the in-sync replica set (ISR). Durability is decided by how many copies must exist before an acknowledgement is returned, and what Kafka is allowed to do when those copies disappear.
| Setting | Where it lives | What it controls |
|---|---|---|
replication.factor | Topic | Total number of copies of each partition (the ceiling on durability) |
min.insync.replicas | Broker default / topic override | Minimum ISR members required for an acks=all write to be accepted |
acks | Producer | How many acknowledgements the producer waits for (0, 1, or all) |
unclean.leader.election.enable | Broker / topic | Whether an out-of-sync replica may become leader (trades data for availability) |
The crucial insight: min.insync.replicas is only enforced when the producer uses acks=all. With acks=1 the leader acknowledges as soon as it writes to its own log, so a single broker failure right after the ack can lose the record. With acks=0 the producer does not wait at all.
Setting
replication.factor=3alone does not make you durable. If producers still send withacks=1, you can lose acknowledged data the instant the leader crashes before followers replicate. Durability requires the producer and the topic to agree.
How acks and min.insync.replicas work together
min.insync.replicas is a floor, not a target. It says: “refuse acks=all writes unless at least N replicas are currently in sync.” When the ISR shrinks below that floor, the leader rejects produce requests with NotEnoughReplicasException, forcing the producer to retry rather than accept a write that could be lost.
The standard durable triple is RF=3, min.insync.replicas=2, acks=all. This tolerates the loss of any single broker: with one broker down the ISR still has 2 members, which satisfies the floor, so writes continue. Every acknowledged record exists on at least 2 brokers, so a subsequent single failure cannot lose committed data.
Why not min.insync.replicas=3? With RF=3 and a floor of 3, losing any single broker immediately blocks all writes — you have traded availability for no extra durability. The floor should always be RF minus your tolerated simultaneous failures, typically RF - 1.
Failure scenarios walked through
Assume a partition with replicas on brokers B1 (leader), B2, B3, and the durable triple above.
ISR = {B1, B2, B3} acks=all requires >= 2 in sync
- One broker fails (B3 down): ISR shrinks to
{B1, B2}. That still meetsmin.insync.replicas=2, so producers keep writing. Reads and writes continue uninterrupted. - Leader fails (B1 down): A clean election promotes B2 or B3 (whichever is in the ISR) to leader. No data loss, because the new leader was fully caught up.
- Two brokers fail (B2, B3 down): ISR shrinks to
{B1}, below the floor of 2. The leader now rejectsacks=allwrites withNotEnoughReplicasException. The partition becomes read-only until a replica catches up. This is the system protecting your data — better a paused producer than a silent loss. - All ISR members lost, then a stale replica returns: With
unclean.leader.election.enable=false, Kafka waits for an in-sync replica to come back. With it set totrue, a lagging out-of-sync replica can be elected, discarding any records it never replicated. Keep it disabled unless availability genuinely outranks correctness.
The standard durable configuration
Set the durability floor at the broker level so new topics inherit it, and disable unclean election globally.
# server.properties (broker default)
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
Create or alter the topic with an explicit replication factor and override:
kafka-topics.sh --create \
--bootstrap-server localhost:9092 \
--topic orders \
--partitions 12 \
--replication-factor 3 \
--config min.insync.replicas=2
kafka-configs.sh --alter \
--bootstrap-server localhost:9092 \
--topic orders \
--add-config min.insync.replicas=2,unclean.leader.election.enable=false
On the producer side, request full acknowledgement and enable idempotence (which also forces acks=all, bounded in-flight requests, and infinite retries, eliminating duplicates from retries).
# application.yml — Spring for Apache Kafka
spring:
kafka:
producer:
acks: all
retries: 2147483647
properties:
enable.idempotence: true
max.in.flight.requests.per.connection: 5
delivery.timeout.ms: 120000
The equivalent with the plain client:
var props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
try (var producer = new KafkaProducer<String, String>(props)) {
var record = new ProducerRecord<>("orders", "order-42", "{\"id\":42}");
var metadata = producer.send(record).get();
System.out.printf("acked partition=%d offset=%d%n",
metadata.partition(), metadata.offset());
}
Output:
acked partition=7 offset=10584
If the ISR drops below the floor while sending, the call surfaces the broker’s rejection:
Output:
org.apache.kafka.common.errors.NotEnoughReplicasException:
Messages are rejected since there are fewer in-sync replicas than required.
Handle this by retrying — the producer’s default retry behavior already does so until delivery.timeout.ms elapses, by which point a replica has usually rejoined the ISR.
Best Practices
- Use RF=3, min.insync.replicas=2, acks=all as the default for any topic carrying data you cannot regenerate.
- Set
min.insync.replicastoRF - 1, never equal to RF — that only sacrifices availability without adding durability. - Keep
unclean.leader.election.enable=falsein production; enable it only for topics where stale data beats unavailability. - Enable producer idempotence so retries triggered by transient ISR shrinkage never create duplicates.
- Spread replicas across racks/availability zones with
broker.rackso a single zone outage cannot drop the ISR below the floor. - Treat
NotEnoughReplicasExceptionas backpressure, not a fatal error — let the producer retry while the cluster heals. - Monitor
UnderMinIsrPartitionCountandUnderReplicatedPartitions; alert before the ISR reaches the floor, not after writes start failing.