The Controller & Cluster Metadata

Every Kafka cluster needs a single source of truth that tracks which brokers are alive, which broker leads each partition, and where replicas live. That job belongs to the controller — the brain of the cluster. When a broker dies, the controller is what notices, elects new partition leaders, and propagates the updated metadata so producers and consumers keep working. Understanding the controller is essential for reasoning about availability, failover latency, and why modern Kafka abandoned ZooKeeper in favor of KRaft.

What the controller does

The controller is a specially designated role within the cluster responsible for cluster-wide coordination. At any moment exactly one node holds the active controller role. Its core responsibilities are:

Broker membership — tracking which brokers are registered, alive, and fenced. Brokers send periodic heartbeats; when they stop, the controller marks the broker as failed.
Leader election — for every partition, choosing which replica is the leader. When a leader’s broker goes down, the controller picks a new leader from the in-sync replica set (ISR).
Partition reassignment — driving the movement of replicas between brokers when you rebalance load or add/remove brokers.
ISR management — recording shrink and expand events for each partition’s in-sync replica set.
Metadata propagation — distributing the authoritative cluster state (topics, partitions, leaders, ISRs) to every broker so they can answer client metadata requests.

Clients never talk to the controller directly. They fetch metadata from any broker, and that broker answers from its locally cached copy of the controller-maintained state.

How metadata flows to clients

                 +------------------+
                 |  Active Controller |
                 +---------+--------+
                           | metadata updates
            +--------------+--------------+
            v              v              v
       +--------+     +--------+     +--------+
       |Broker 1|     |Broker 2|     |Broker 3|
       +---+----+     +--------+     +--------+
           ^ metadata request (leaders, ISR)
           |
       +---+----+
       | Client |  produce/consume to partition leaders
       +--------+

A producer that wants to write to orders-3 first asks a broker, “who leads partition 3 of orders?” The broker replies from its cached metadata, and the producer connects directly to that leader. If leadership changes, the next produce attempt returns NOT_LEADER_OR_FOLLOWER and the client refreshes metadata.

The legacy ZooKeeper controller

Historically Kafka stored all cluster metadata in ZooKeeper, an external coordination service. One broker was elected controller by racing to create an ephemeral /controller znode in ZooKeeper. Whichever broker won held the role; if it crashed, the znode disappeared and the remaining brokers raced again.

This design worked but had real limitations:

Metadata changes were written to ZooKeeper and then pushed to brokers one RPC at a time, so large clusters suffered slow controller failover and slow startup.
The number of partitions a cluster could support was effectively bounded by how fast the controller could reload state from ZooKeeper.
Operators had to run, secure, and tune a second distributed system alongside Kafka.

Note: ZooKeeper mode is deprecated. Kafka 4.0 removed ZooKeeper support entirely — all supported clusters now run in KRaft mode. Treat the ZooKeeper controller as background context, not something to deploy today.

The KRaft controller quorum

KRaft (Kafka Raft) replaces ZooKeeper with a built-in Raft-based consensus protocol. Instead of one elected broker plus an external service, you run a small set of dedicated controller nodes that form a quorum and replicate metadata among themselves as an internal log topic, __cluster_metadata.

Aspect	ZooKeeper controller	KRaft controller quorum
Metadata store	External ZooKeeper ensemble	Internal `__cluster_metadata` log
Controller count	1 active broker	Odd quorum (typically 3 or 5)
Election	Race for ephemeral znode	Raft leader election
Failover	Reload full state from ZK	Replay tail of replicated log
Operational footprint	Kafka + ZooKeeper	Kafka only

In KRaft, metadata is itself an event log. The active controller is the Raft leader; followers replicate its log. Because every controller already has the full metadata log on disk, failover is fast — a new leader resumes from the log tail rather than reloading everything. Brokers consume this metadata log just like any other Kafka log, applying changes incrementally.

A minimal dedicated controller is configured like this:

# controller.properties
process.roles=controller
node.id=1
controller.quorum.voters=1@host1:9093,2@host2:9093,3@host3:9093
controller.listener.names=CONTROLLER
listeners=CONTROLLER://:9093
metadata.log.dir=/var/lib/kafka/metadata

You can inspect the live quorum at any time:

kafka-metadata-quorum.sh --bootstrap-server host1:9092 describe --status

Output:

ClusterId:              4L6g3nShT-eMCtK--X86sw
LeaderId:               1
LeaderEpoch:            14
HighWatermark:          1029
CurrentVoters:          [1,2,3]
CurrentObservers:       [4,5,6]

Here node 1 is the active controller (the Raft leader), nodes 2 and 3 are voter followers, and the brokers (4, 5, 6) are observers that consume metadata without voting.

Tip: Run an odd number of controllers — 3 tolerates one failure, 5 tolerates two. An even count buys no extra fault tolerance while increasing the quorum size needed to commit.

Best practices

Use dedicated controller nodes (process.roles=controller) in production rather than combined mode, so metadata coordination is isolated from data traffic.
Run 3 or 5 controllers, never an even number, and spread them across failure domains (racks/zones).
Put metadata.log.dir on fast, durable storage — controller throughput depends on how quickly the metadata log can be appended and replayed.
Monitor ActiveControllerCount (should be exactly 1 cluster-wide) and metadata commit latency to catch quorum problems early.
Size client metadata.max.age.ms sensibly so clients refresh leadership info promptly after a failover without hammering brokers.
If you still operate a ZooKeeper cluster, plan a migration to KRaft before upgrading past Kafka 3.x, since ZooKeeper support is removed in 4.0.