Leader Election

Every Kafka partition has exactly one leader replica that handles all reads and writes; the remaining replicas are followers that passively copy the leader’s log. When a broker fails, the partitions it led must be reassigned to surviving in-sync replicas, and that handoff is leader election. Understanding how Kafka picks the next leader — and the dangerous shortcut of unclean election — is essential because it directly governs your cluster’s availability and whether you can silently lose committed data during an outage.

Why leadership matters

Producers and consumers only ever talk to the leader of a partition. Followers exist purely for redundancy: they fetch records from the leader and try to stay caught up. The set of replicas that are sufficiently caught up is the in-sync replica (ISR) set. A record is only considered committed (and visible to consumers below the high watermark) once every member of the ISR has it. Leader election is therefore the mechanism that keeps a partition writable when its current leader disappears, while trying to preserve those committed records.

What triggers an election

Elections happen automatically in several situations:

Broker failure or shutdown — the leader broker crashes, is killed, or is stopped for maintenance.
A leader falls out of the ISR — typically because its disk or network stalled and it can no longer serve its replicas.
Preferred leader rebalancing — Kafka periodically moves leadership back to the “preferred” replica to keep load even (see below).
Partition reassignment — when you move replicas between brokers with the reassignment tool.

In all cases the controller — a single elected broker that owns cluster metadata — drives the decision.

How the controller elects a leader

The controller maintains the authoritative view of every partition’s replica list and ISR. When it detects that a partition has no usable leader, it scans that partition’s replica assignment list in order and selects the first replica that is currently in the ISR. It then updates the partition metadata, increments the leader epoch, and propagates the new leader to all brokers and clients via metadata updates.

In KRaft mode (the modern, ZooKeeper-free architecture), this metadata lives in the internal __cluster_metadata log and the controller quorum agrees on every change through Raft. The election logic is the same in spirit as the legacy ZooKeeper-based controller, but the source of truth is the replicated metadata log rather than ZooKeeper znodes.

The leader epoch is a monotonically increasing number attached to each new leadership term. Followers and clients use it to reject stale leaders and detect log divergence, which prevents a recovering old leader from corrupting the log.

Preferred leaders and auto rebalancing

The preferred leader is simply the first broker listed in a partition’s replica assignment. When you create a topic, Kafka spreads these preferred leaders evenly across brokers so write traffic is balanced. After a failover, leadership often piles up on a few brokers; rebalancing returns it to the preferred replicas once they are back in the ISR.

This is controlled at the broker level:

# Allow the controller to automatically move leadership
# back to preferred replicas
auto.leader.rebalance.enable=true

# How often (ms) the controller checks for imbalance
leader.imbalance.check.interval.seconds=300

# Trigger a rebalance when this percent of a broker's
# partitions are led by a non-preferred replica
leader.imbalance.per.broker.percentage=10

You can also force a rebalance manually:

kafka-leader-election.sh \
  --bootstrap-server localhost:9092 \
  --election-type PREFERRED \
  --all-topic-partitions

Output:

Successfully completed leader election (PREFERRED) for partitions orders-0, orders-1, payments-0

Unclean leader election: availability vs durability

The most consequential setting in this area is unclean.leader.election.enable. Normally Kafka only elects a leader from the ISR. But what if every in-sync replica is down and only a stale, out-of-sync replica remains? You face a choice:

Setting	Behavior when no ISR is available	Trade-off
`unclean.leader.election.enable=false` (default)	Partition stays offline until an in-sync replica returns	Durability first — no committed data is lost, but the partition is unavailable
`unclean.leader.election.enable=true`	An out-of-sync replica is promoted to leader	Availability first — the partition recovers immediately, but records the old leader had committed are permanently lost and offsets may be truncated

The default is false, and for most production data you should keep it that way:

unclean.leader.election.enable=false

Enabling unclean election can cause silent data loss and offset truncation, which can confuse consumers that had already advanced past those offsets. Only enable it for topics where availability strictly outranks correctness — for example, ephemeral metrics or logs you can afford to drop.

You can override it per topic for fine-grained control:

kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name metrics-firehose \
  --alter --add-config unclean.leader.election.enable=true

Observing elections from a client

Clients react to leadership changes transparently by refreshing metadata, but you can tune how aggressively a producer retries during the brief window when a partition has no leader:

Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
        "org.apache.kafka.common.serialization.StringSerializer");
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
        "org.apache.kafka.common.serialization.StringSerializer");
// Wait for all in-sync replicas to acknowledge — required
// if you want acks to mean "committed" across elections
props.put(ProducerConfig.ACKS_CONFIG, "all");
// Retry transparently while a new leader is being elected
props.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
props.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 120_000);

try (KafkaProducer<String, String> producer = new KafkaProducer<>(props)) {
    producer.send(new ProducerRecord<>("orders", "order-42", "created"));
}

With acks=all and a healthy ISR, a clean election never loses an acknowledged record — the new leader already holds everything up to the high watermark.

Best Practices

Keep unclean.leader.election.enable=false for any topic carrying data you cannot afford to lose; only relax it deliberately per topic.
Set a replication factor of at least 3 and min.insync.replicas=2 so a single broker failure never leaves a partition without a safe leadership candidate.
Always produce with acks=all for durable topics so committed means “replicated to the full ISR” before and after an election.
Leave auto.leader.rebalance.enable=true so leadership returns to preferred replicas and load stays balanced after failovers.
Run preferred-leader rebalancing during low-traffic windows if your imbalance threshold is aggressive, since rebalancing briefly disrupts affected partitions.
Monitor UnderReplicatedPartitions, OfflinePartitionsCount, and the active controller count so you catch elections and shrinking ISRs before they become outages.