Skip to content
Apache Kafka kf reliability 5 min read

Durability & Replication Tuning

Durability in Kafka is not a single switch — it emerges from the interaction of four settings that live in different places: the topic’s replication factor, the broker/topic min.insync.replicas, the producer’s acks, and whether unclean leader election is allowed. Set any one of them wrong and the other three can quietly lull you into a false sense of safety. This page explains how these knobs combine, walks through what happens when brokers fail, and gives you the battle-tested durable configuration to copy.

The four knobs that define durability

Every committed record in Kafka lives on a partition that is replicated across a set of brokers. One replica is the leader; the rest are followers that fetch from it. The subset of replicas that are fully caught up is the in-sync replica set (ISR). Durability is decided by how many copies must exist before an acknowledgement is returned, and what Kafka is allowed to do when those copies disappear.

SettingWhere it livesWhat it controls
replication.factorTopicTotal number of copies of each partition (the ceiling on durability)
min.insync.replicasBroker default / topic overrideMinimum ISR members required for an acks=all write to be accepted
acksProducerHow many acknowledgements the producer waits for (0, 1, or all)
unclean.leader.election.enableBroker / topicWhether an out-of-sync replica may become leader (trades data for availability)

The crucial insight: min.insync.replicas is only enforced when the producer uses acks=all. With acks=1 the leader acknowledges as soon as it writes to its own log, so a single broker failure right after the ack can lose the record. With acks=0 the producer does not wait at all.

Setting replication.factor=3 alone does not make you durable. If producers still send with acks=1, you can lose acknowledged data the instant the leader crashes before followers replicate. Durability requires the producer and the topic to agree.

How acks and min.insync.replicas work together

min.insync.replicas is a floor, not a target. It says: “refuse acks=all writes unless at least N replicas are currently in sync.” When the ISR shrinks below that floor, the leader rejects produce requests with NotEnoughReplicasException, forcing the producer to retry rather than accept a write that could be lost.

The standard durable triple is RF=3, min.insync.replicas=2, acks=all. This tolerates the loss of any single broker: with one broker down the ISR still has 2 members, which satisfies the floor, so writes continue. Every acknowledged record exists on at least 2 brokers, so a subsequent single failure cannot lose committed data.

Why not min.insync.replicas=3? With RF=3 and a floor of 3, losing any single broker immediately blocks all writes — you have traded availability for no extra durability. The floor should always be RF minus your tolerated simultaneous failures, typically RF - 1.

Failure scenarios walked through

Assume a partition with replicas on brokers B1 (leader), B2, B3, and the durable triple above.

ISR = {B1, B2, B3}   acks=all requires >= 2 in sync
  • One broker fails (B3 down): ISR shrinks to {B1, B2}. That still meets min.insync.replicas=2, so producers keep writing. Reads and writes continue uninterrupted.
  • Leader fails (B1 down): A clean election promotes B2 or B3 (whichever is in the ISR) to leader. No data loss, because the new leader was fully caught up.
  • Two brokers fail (B2, B3 down): ISR shrinks to {B1}, below the floor of 2. The leader now rejects acks=all writes with NotEnoughReplicasException. The partition becomes read-only until a replica catches up. This is the system protecting your data — better a paused producer than a silent loss.
  • All ISR members lost, then a stale replica returns: With unclean.leader.election.enable=false, Kafka waits for an in-sync replica to come back. With it set to true, a lagging out-of-sync replica can be elected, discarding any records it never replicated. Keep it disabled unless availability genuinely outranks correctness.

The standard durable configuration

Set the durability floor at the broker level so new topics inherit it, and disable unclean election globally.

# server.properties (broker default)
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false

Create or alter the topic with an explicit replication factor and override:

kafka-topics.sh --create \
  --bootstrap-server localhost:9092 \
  --topic orders \
  --partitions 12 \
  --replication-factor 3 \
  --config min.insync.replicas=2

kafka-configs.sh --alter \
  --bootstrap-server localhost:9092 \
  --topic orders \
  --add-config min.insync.replicas=2,unclean.leader.election.enable=false

On the producer side, request full acknowledgement and enable idempotence (which also forces acks=all, bounded in-flight requests, and infinite retries, eliminating duplicates from retries).

# application.yml — Spring for Apache Kafka
spring:
  kafka:
    producer:
      acks: all
      retries: 2147483647
      properties:
        enable.idempotence: true
        max.in.flight.requests.per.connection: 5
        delivery.timeout.ms: 120000

The equivalent with the plain client:

var props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);

try (var producer = new KafkaProducer<String, String>(props)) {
    var record = new ProducerRecord<>("orders", "order-42", "{\"id\":42}");
    var metadata = producer.send(record).get();
    System.out.printf("acked partition=%d offset=%d%n",
        metadata.partition(), metadata.offset());
}

Output:

acked partition=7 offset=10584

If the ISR drops below the floor while sending, the call surfaces the broker’s rejection:

Output:

org.apache.kafka.common.errors.NotEnoughReplicasException:
  Messages are rejected since there are fewer in-sync replicas than required.

Handle this by retrying — the producer’s default retry behavior already does so until delivery.timeout.ms elapses, by which point a replica has usually rejoined the ISR.

Best Practices

  • Use RF=3, min.insync.replicas=2, acks=all as the default for any topic carrying data you cannot regenerate.
  • Set min.insync.replicas to RF - 1, never equal to RF — that only sacrifices availability without adding durability.
  • Keep unclean.leader.election.enable=false in production; enable it only for topics where stale data beats unavailability.
  • Enable producer idempotence so retries triggered by transient ISR shrinkage never create duplicates.
  • Spread replicas across racks/availability zones with broker.rack so a single zone outage cannot drop the ISR below the floor.
  • Treat NotEnoughReplicasException as backpressure, not a fatal error — let the producer retry while the cluster heals.
  • Monitor UnderMinIsrPartitionCount and UnderReplicatedPartitions; alert before the ISR reaches the floor, not after writes start failing.
Last updated June 1, 2026
Was this helpful?