Navigation

Apache Kafka interview 5 min read

Kafka Interview Questions

Apache Kafka interviews probe whether you understand the trade-offs behind a distributed log, not just the API surface. The strongest answers tie every claim back to concrete settings — acknowledgements, replication, and offset commits — because those knobs decide durability and throughput in production. This page groups the most commonly asked questions by theme so you can rehearse crisp, technically precise answers. Treat each answer as a starting point you can expand with a real war story.

Fundamentals

What is Apache Kafka and what problem does it solve?

Kafka is a distributed, append-only commit log used as a durable, high-throughput messaging and event-streaming platform. Producers write records to topics, brokers persist them on disk and replicate them, and consumers read them independently at their own pace. It decouples systems and lets many consumers replay the same data, which a traditional point-to-point queue cannot do cheaply.

What is a topic, partition, and offset?

A topic is a named stream of records. Each topic is split into partitions, which are the unit of parallelism and ordering. Within a partition every record gets a monotonically increasing offset, a long that uniquely identifies the record’s position. Consumers track which offset they have processed.

Does Kafka still require ZooKeeper?

No. Modern Kafka runs in KRaft mode, where a quorum of controllers manages metadata using the Raft protocol. ZooKeeper is removed in current releases; mention it only as legacy if asked about older clusters.

How does Kafka achieve durability?

Records are written to the partition leader and replicated to follower brokers. The replication.factor controls how many copies exist, and min.insync.replicas defines how many must acknowledge a write before it is considered committed.

Partitions and ordering

What ordering guarantees does Kafka provide?

Kafka guarantees ordering only within a single partition, never across an entire topic. Records with the same key are routed to the same partition, so keying by entity (for example, orderId) preserves per-entity order.

How does the producer decide which partition a record goes to?

If you supply a key, the default partitioner hashes it (murmur2) modulo the partition count. With no key, the sticky partitioner batches records to one partition until the batch fills, then rotates, improving throughput.

What happens to ordering if you increase partition count?

Adding partitions changes key-to-partition mapping for new records, so previously co-located keys may land elsewhere. Existing data is not repartitioned. Plan partition counts up front for keyed streams.

ProducerRecord<String, OrderEvent> record =
    new ProducerRecord<>("orders", order.orderId(), order);
producer.send(record);

Delivery semantics

What are the three delivery semantics?

Semantic	Meaning	How to get it
At-most-once	May lose records, never duplicates	Commit offset before processing
At-least-once	Never loses, may duplicate	Commit offset after processing (default)
Exactly-once	No loss, no duplicates	Idempotent producer + transactions

How do you enable exactly-once processing?

Set enable.idempotence=true on the producer (default in modern clients), use transactions with a transactional.id, and have consumers read with isolation.level=read_committed. For read-process-write loops, use sendOffsetsToTransaction so offsets and output commit atomically.

enable.idempotence=true
acks=all
transactional.id=order-processor-1

What does the `acks` setting control?

It controls how many acknowledgements the producer waits for: acks=0 (fire and forget), acks=1 (leader only), and acks=all (leader plus all in-sync replicas). Use acks=all with min.insync.replicas=2 for durability.

Tip: acks=all alone is not enough. If min.insync.replicas=1, a single surviving replica can still acknowledge and you can lose committed data. Always pair them.

Consumer groups

What is a consumer group?

A consumer group is a set of consumers that cooperatively read a topic. Each partition is assigned to exactly one consumer in the group, so the group as a whole scales horizontally up to the partition count. Extra consumers beyond the partition count sit idle.

What triggers a rebalance, and why is it costly?

A rebalance happens when a consumer joins, leaves, crashes (session timeout), or partitions change. During an eager rebalance all consumers stop and reassign, causing a stop-the-world pause. Cooperative-sticky assignment reduces this by only moving the partitions that need to move.

How are offsets committed?

Offsets are stored in the internal __consumer_offsets topic. You can auto-commit on an interval (enable.auto.commit=true) or commit manually for precise control. Manual commit after successful processing gives at-least-once delivery.

@KafkaListener(topics = "orders", groupId = "billing")
public void handle(OrderEvent event, Acknowledgment ack) {
    process(event);
    ack.acknowledge(); // manual commit after success
}

What is consumer lag?

Lag is the difference between the log-end offset of a partition and the consumer’s committed offset — how far behind the consumer is. Sustained, growing lag means consumers cannot keep up with producers.

kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group billing

Output:

GROUP   TOPIC   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
billing orders  0          10542           10560           18
billing orders  1          9981            9981            0

Operations

What is the difference between log retention and log compaction?

Time/size retention (cleanup.policy=delete) discards old segments after retention.ms or retention.bytes. Compaction (cleanup.policy=compact) keeps the latest value per key, making the topic a changelog suitable for state stores.

How do you size partition count?

Estimate target throughput divided by per-partition throughput, factor in the maximum consumer parallelism you need, and avoid going so high that per-partition overhead and rebalance cost hurt. More partitions mean more open files and longer leader-election times.

What is an ISR and what happens when a replica falls out?

The in-sync replica set is the replicas caught up with the leader. A replica drops out if it lags beyond replica.lag.time.max.ms. If the ISR shrinks below min.insync.replicas, producers using acks=all get NotEnoughReplicasException and writes are rejected to protect durability.

How do you handle poison messages?

Route records that repeatedly fail processing to a dead-letter topic after a bounded number of retries, so a single bad record does not block the partition.

Best practices

Always state your acks, replication.factor, and min.insync.replicas assumptions when discussing durability — they define the guarantee.
Key records by the entity whose order matters, and fix partition counts early for keyed streams.
Commit offsets after processing for at-least-once, and design consumers to be idempotent so duplicates are harmless.
Prefer cooperative-sticky assignment to minimize rebalance pauses.
Monitor consumer lag as a first-class SLO, not an afterthought.
Use a dead-letter topic for poison messages so one bad record never stalls a partition.

Interview tip: Whenever you answer a Kafka durability or delivery question, explicitly state your assumptions about acks, replication factor, min.insync.replicas, and where offsets are committed. Interviewers are listening for exactly that precision.