min.insync.replicas Explained

min.insync.replicas is the single most important durability knob in Kafka, yet it does nothing on its own — it only takes effect when producers write with acks=all. Together they form a contract: a write is acknowledged only when enough replicas have it on disk. Get this pairing wrong and you either silently lose data on broker failure or block all writes the moment one broker hiccups. This page explains exactly how the setting behaves and how to tune it for production.

What min.insync.replicas actually does

Every partition has a set of replicas. The subset that is fully caught up with the leader is called the in-sync replica set (ISR). min.insync.replicas defines the minimum size the ISR must have for a write to be accepted.

The rule fires only for acks=all producers:

With acks=all, the leader waits for every replica currently in the ISR to acknowledge the record. Before doing so, it checks the ISR size. If |ISR| < min.insync.replicas, the leader rejects the write immediately and the producer receives a NotEnoughReplicasException (or NotEnoughReplicasAfterAppendException).
With acks=1 or acks=0, min.insync.replicas is ignored entirely — the leader acknowledges without consulting the ISR. This is the most common misconfiguration: setting min.insync.replicas=2 while leaving producers at the default and assuming you are safe.

So durability comes from the combination acks=all plus min.insync.replicas >= 2. Neither half is sufficient alone.

The availability vs. durability trade-off

The setting is a direct lever between two competing goals.

min.insync.replicas (RF=3)	Replicas a write is guaranteed on	Broker losses tolerated while still accepting writes	Risk
1	1	2	Acked data can vanish if the sole leader dies before replicating
2	2	1	Balanced — recommended default
3	3	0	One broker down or restarting blocks all writes to the partition

Set it too high (e.g. equal to the replication factor) and any single broker being down, restarting for a rolling upgrade, or lagging behind drops the ISR below the threshold, and producers stall with NotEnoughReplicas. Set it too low (1) and Kafka will happily acknowledge a write that exists on only the leader; if that leader fails before the followers catch up, the acknowledged record is lost.

The sweet spot for most clusters is replication factor 3 with min.insync.replicas=2. That tolerates the loss of any one broker while still guaranteeing every acknowledged write lives on at least two machines.

Tip: Always keep min.insync.replicas strictly less than the replication factor. With RF=3 and min.insync=3 you have zero headroom — routine maintenance becomes an outage. RF must give you at least one spare beyond the minimum.

Configuring it

min.insync.replicas is a broker/topic-level setting, while acks is a producer-level setting. Both ends must be configured.

Set it as a broker default and override per topic:

# server.properties — cluster-wide default
min.insync.replicas=2
default.replication.factor=3

Override on an individual topic with the admin CLI:

kafka-topics.sh --create \
  --bootstrap-server localhost:9092 \
  --topic payments \
  --partitions 6 \
  --replication-factor 3 \
  --config min.insync.replicas=2

# Change it on an existing topic
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name payments \
  --alter --add-config min.insync.replicas=2

On the producer side you must opt into acks=all:

spring:
  kafka:
    producer:
      acks: all
      retries: 2147483647
      properties:
        enable.idempotence: true
        max.in.flight.requests.per.connection: 5
        delivery.timeout.ms: 120000

Handling NotEnoughReplicas in application code

When the ISR shrinks, a strong-durability producer should fail loudly rather than silently downgrade. With the Spring KafkaTemplate the failure surfaces through the returned CompletableFuture:

@Service
public class PaymentPublisher {

    private final KafkaTemplate<String, PaymentEvent> kafkaTemplate;

    public PaymentPublisher(KafkaTemplate<String, PaymentEvent> kafkaTemplate) {
        this.kafkaTemplate = kafkaTemplate;
    }

    public void publish(PaymentEvent event) {
        kafkaTemplate.send("payments", event.id(), event)
            .whenComplete((result, ex) -> {
                if (ex instanceof NotEnoughReplicasException) {
                    // ISR is below min.insync.replicas — do NOT treat as written.
                    throw new DurabilityException("Write rejected: insufficient in-sync replicas", ex);
                } else if (ex != null) {
                    throw new DurabilityException("Publish failed", ex);
                }
            });
    }
}

public record PaymentEvent(String id, long amountCents, String currency) {}

A transient NotEnoughReplicasException is retriable: once a follower rejoins the ISR, retries (configured above) will succeed automatically. The danger is treating the error as “probably fine” and dropping the message.

You can inspect the current ISR to confirm headroom before traffic ramps up:

kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic payments

Output:

Topic: payments  Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,2,3
Topic: payments  Partition: 1  Leader: 2  Replicas: 2,3,1  Isr: 2,3

Here partition 1 has only two replicas in sync. With min.insync.replicas=2 it still accepts writes, but a second broker loss would block it.

Best Practices

Use RF=3, min.insync.replicas=2, acks=all as your default for any topic carrying data you cannot lose.
Never set min.insync.replicas equal to the replication factor — leave at least one replica of slack for rolling restarts and upgrades.
Remember that min.insync.replicas is inert without acks=all; audit producer configs, not just topic configs.
Enable producer idempotence (enable.idempotence=true) so retries after a transient NotEnoughReplicas do not create duplicates.
Set retries high (effectively unlimited) with a bounded delivery.timeout.ms so brief ISR dips self-heal instead of failing the request.
Treat NotEnoughReplicasException as a hard failure in code — surface it, alert on it, and never acknowledge the write to upstream callers.
Monitor UnderMinIsrPartitionCount and IsrShrinksPerSec so you are warned before partitions become unwritable.