Navigation

Apache Kafka interview 6 min read

Advanced Kafka Interview Questions

Senior Kafka interviews move past “what is a topic” and into the mechanics that decide whether a pipeline loses, duplicates, or reorders data under failure. The questions below probe exactly-once semantics, the transaction protocol, KRaft internals, consumer group rebalancing, Kafka Streams state, and the tuning knobs that matter at scale. Each answer is short enough to say out loud but precise enough to survive follow-ups.

Exactly-once and transactions

What does idempotent producer actually guarantee?

With enable.idempotence=true, the broker deduplicates retries from a single producer session. Each producer gets a Producer ID (PID) and stamps every record batch with a monotonically increasing sequence number per partition. The broker rejects a batch whose sequence is a duplicate or out of order, so retries do not create duplicates. It guarantees exactly-once delivery to a partition for one producer session — not across producer restarts and not across multiple partitions atomically.

Idempotence alone does not give you exactly-once processing. It removes producer-retry duplicates; it does nothing about consumer reprocessing after a crash.

How do Kafka transactions achieve exactly-once across read-process-write?

Transactions tie the producer’s writes and the consumer’s offset commits into one atomic unit. The producer is configured with a stable transactional.id, calls initTransactions(), then wraps work in beginTransaction() / commitTransaction(). Crucially, offsets are committed via the producer using sendOffsetsToTransaction, so input progress and output records commit or abort together.

producer.initTransactions();
while (true) {
    var records = consumer.poll(Duration.ofMillis(200));
    producer.beginTransaction();
    try {
        for (var record : records) {
            producer.send(new ProducerRecord<>("out", record.key(), transform(record.value())));
        }
        var offsets = currentOffsets(records);
        producer.sendOffsetsToTransaction(offsets, consumer.groupMetadata());
        producer.commitTransaction();
    } catch (KafkaException e) {
        producer.abortTransaction();
    }
}

What is the role of the transaction coordinator and LSO?

Each transactional.id maps to a transaction coordinator (a broker) backed by the __transaction_state log. The coordinator writes commit/abort markers to every partition the transaction touched. Consumers with isolation.level=read_committed only read up to the Last Stable Offset (LSO) — the offset before the earliest open transaction — so they never see uncommitted or aborted records. With read_uncommitted, the consumer reads up to the high watermark and may see records that later abort.

What problem does transactional.id fencing solve?

If a producer hangs and a new instance starts with the same transactional.id, the new instance bumps the producer epoch during initTransactions(). The coordinator fences the old (zombie) producer: any write or commit from the stale epoch is rejected with ProducerFencedException. This prevents a partitioned-off zombie from corrupting an in-progress transaction.

KRaft and the broker internals

KRaft vs ZooKeeper — what changed?

KRaft replaces ZooKeeper with an internal Raft quorum that stores metadata in the __cluster_metadata log.

Aspect	ZooKeeper mode	KRaft mode
Metadata store	External ZK ensemble	Internal `__cluster_metadata` log
Controller	One elected broker	Quorum of controller nodes
Failover	ZK watches, slow reload	Raft, near-instant
Scale ceiling	~tens of thousands partitions	Millions of partitions
Status	Removed in 4.0	Default and only mode

The big wins are faster controller failover, far higher partition counts, and a single system to operate. As of Kafka 4.0, ZooKeeper is gone entirely.

What is the high watermark, and how does it differ from LEO?

The Log End Offset (LEO) is the offset of the next record to be appended on a replica. The high watermark (HW) is the highest offset replicated to all in-sync replicas (ISR); consumers can only read up to the HW. A record above the HW exists on the leader but is not yet “committed,” so it is invisible and can still be lost if the leader fails before replication completes.

How does `min.insync.replicas` interact with `acks=all`?

acks=all means the leader waits for all current ISR members to acknowledge. min.insync.replicas sets the floor on ISR size for a write to succeed. With replication factor 3 and min.insync.replicas=2, you can lose one replica and still write; if ISR drops to 1, the producer gets NotEnoughReplicasException rather than silently writing under-replicated data.

Rebalancing and consumer groups

Eager vs cooperative (incremental) rebalancing?

Eager rebalancing revokes all partitions from all members, then reassigns — a stop-the-world pause. Cooperative rebalancing (CooperativeStickyAssignor) revokes only the partitions that must move, so unaffected consumers keep processing. Use the cooperative assignor in production to slash rebalance downtime.

What is static membership and why use it?

Setting group.instance.id makes a consumer a static member. On a brief restart, the broker recognizes the same instance ID within session.timeout.ms and does not trigger a rebalance, avoiding churn during rolling deploys.

Kafka Streams

How does Kafka Streams store state and survive failures?

Stateful operators (aggregations, joins, windows) keep state in local RocksDB stores. Every update is also written to a compacted changelog topic in Kafka. If a task moves to another instance, the new instance rebuilds state by replaying the changelog. Standby replicas (num.standby.replicas) keep warm copies to shorten recovery.

How does exactly-once work in Kafka Streams?

Set processing.guarantee=exactly_once_v2. Streams uses transactions under the hood — atomically committing output records, state-store changelog updates, and consumer offsets in one transaction per commit interval. v2 shares a single producer per instance across tasks, drastically reducing the producer overhead of the original EOS implementation.

processing.guarantee=exactly_once_v2
commit.interval.ms=100
num.standby.replicas=1

Performance and tuning

Which producer settings trade throughput for latency?

linger.ms waits to batch records; batch.size caps batch bytes; compression.type (e.g. lz4, zstd) shrinks payloads. Raising linger.ms and batch.size increases throughput and efficiency at the cost of a little latency.

How do you size partitions for throughput?

Target throughput divided by per-partition throughput gives a partition count, bounded by the maximum of producer and consumer needs. More partitions raise parallelism but add open file handles, more leader elections, and longer rebalances. Measure first; over-partitioning is a common, hard-to-reverse mistake.

A consumer lag keeps growing despite low CPU — what do you check?

Look for a slow downstream call inside poll-loop processing, max.poll.records too high causing max.poll.interval.ms breaches and rebalances, uneven partition assignment (hot keys), or insufficient consumer instances relative to partitions. Confirm via kafka-consumer-groups --describe.

kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group orders-service --describe

Output:

GROUP          TOPIC   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG   CONSUMER-ID
orders-service orders  0          10342           10999           657   consumer-1
orders-service orders  1          9871            9890            19    consumer-2

Why can ordering break even with a single partition?

With max.in.flight.requests.per.connection > 1 and retries enabled but idempotence disabled, a retried earlier batch can land after a later one, reordering records. Idempotence preserves order for up to 5 in-flight requests; without it, set in-flight to 1 to keep strict order.

Best Practices

Prefer exactly_once_v2 over manual transactions in Streams; reserve hand-rolled transactions for custom read-process-write loops.
Always pair acks=all with min.insync.replicas >= 2 and replication factor 3 for durability without write stalls.
Use a stable, unique transactional.id per logical producer to enable zombie fencing.
Run the CooperativeStickyAssignor and static membership to make deploys and scaling nearly rebalance-free.
Enable idempotence by default; it is cheap insurance against retry duplicates and ordering bugs.
Standardize on KRaft and keep read_committed for any consumer downstream of transactional producers.
Load-test partition counts before committing; you cannot easily reduce partitions later.