Consumer Tuning
Consumer tuning is about two things at once: moving records off the broker efficiently, and staying a healthy, stable member of the consumer group so you never trigger needless rebalances. In production the most common consumer pathology is not slow fetching — it is a consumer that pulls a big batch, takes too long to process it, misses the next poll(), and gets evicted from the group, which stalls every partition it owned. Getting the fetch sizing, the poll loop budget, and the group timeouts right is what keeps a pipeline both fast and steady.
How the poll loop drives everything
A Kafka consumer is single-threaded per instance: it calls poll(), gets a batch of up to max.poll.records, processes them, and calls poll() again. Between polls a background thread sends heartbeats to prove liveness, but the broker also expects the next poll() within max.poll.interval.ms. If your processing of one batch exceeds that interval, the broker assumes the consumer is dead and rebalances its partitions away.
poll() -> [up to max.poll.records] -> process batch -> poll() -> ...
^ must finish a full cycle within max.poll.interval.ms
The golden rule: max.poll.records × per-record processing time must stay comfortably under max.poll.interval.ms. Tune the batch size and the interval together.
Fetch sizing: bigger reads, fewer round trips
Fetch settings control how much data the broker accumulates before answering a fetch request. Larger fetches amortize network overhead and raise throughput; smaller, time-bounded fetches lower latency. fetch.min.bytes is the amount the broker waits to accumulate, and fetch.max.wait.ms caps that wait so a quiet topic does not starve.
| Property | Default | Tuned (throughput) | Effect |
|---|---|---|---|
fetch.min.bytes | 1 | 65536–1048576 (64 KB–1 MB) | Broker waits for more data per fetch |
fetch.max.wait.ms | 500 | 100–500 | Upper bound on the wait when data is thin |
max.partition.fetch.bytes | 1048576 (1 MB) | 2097152+ (2 MB+) | Max bytes returned per partition per fetch |
fetch.max.bytes | 52428800 (50 MB) | 52428800+ | Max bytes across all partitions per fetch |
max.poll.records | 500 | 1000–2000 | Records returned per poll() call |
Gotcha:
max.partition.fetch.bytesmust be at least as large as the broker’smax.message.bytes, or a single oversized record can wedge the consumer — it cannot fetch a message it has no room for, and progress on that partition stops.
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092,broker2:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "events-processor");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
// Throughput-oriented fetch sizing
props.put(ConsumerConfig.FETCH_MIN_BYTES_CONFIG, 1024 * 1024); // wait for 1 MB
props.put(ConsumerConfig.FETCH_MAX_WAIT_MS_CONFIG, 250); // but no longer than 250 ms
props.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, 4 * 1024 * 1024);
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 1000);
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 300000); // 5 min processing budget
try (KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props)) {
consumer.subscribe(List.of("events"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(500));
for (ConsumerRecord<String, String> record : records) {
process(record.value());
}
consumer.commitSync();
}
}
Group membership: heartbeats and timeouts
Group stability is governed by three timeouts. heartbeat.interval.ms is how often the background thread pings the coordinator; session.timeout.ms is how long the coordinator waits without a heartbeat before evicting the member; and max.poll.interval.ms (above) covers slow processing specifically. The heartbeat interval should be roughly one-third of the session timeout so a transient hiccup does not cause eviction.
| Property | Default | Guidance |
|---|---|---|
session.timeout.ms | 45000 | Eviction window for missed heartbeats; raise on flaky networks |
heartbeat.interval.ms | 3000 | Keep at ~1/3 of session.timeout.ms |
max.poll.interval.ms | 300000 | Raise it (or lower max.poll.records) if processing is slow |
Warning: Do not “fix” rebalances by inflating
max.poll.interval.msto an hour. A consumer that legitimately died will then hold its partitions for that whole window, blocking progress. Lowermax.poll.recordsso each batch finishes quickly, and only widen the interval as a last resort.
Concurrency in Spring Boot
A single consumer instance reads its assigned partitions sequentially. To process partitions in parallel within one application, set concurrency on the listener container — Spring runs that many consumer threads, each owning a slice of the partitions. Never set it higher than the partition count; extra threads just idle.
@KafkaListener(topics = "events", groupId = "events-processor", concurrency = "6")
public void consume(List<ConsumerRecord<String, String>> batch) {
// batch listener amortizes per-record overhead across the whole poll
for (ConsumerRecord<String, String> record : batch) {
process(record.value());
}
}
The same tuning keys live under spring.kafka.consumer:
spring:
kafka:
bootstrap-servers: broker1:9092,broker2:9092
consumer:
group-id: events-processor
max-poll-records: 1000
fetch-min-size: 1048576 # 1 MB
fetch-max-wait: 250ms
properties:
max.partition.fetch.bytes: 4194304 # 4 MB
max.poll.interval.ms: 300000
session.timeout.ms: 45000
heartbeat.interval.ms: 15000
listener:
type: batch
Throughput vs latency profiles
There is no universal setting — pick a profile and benchmark it with real payloads.
| Setting | Throughput profile | Latency profile |
|---|---|---|
fetch.min.bytes | 1 MB | 1 (default) |
fetch.max.wait.ms | 250–500 | 10–50 |
max.partition.fetch.bytes | 4 MB+ | 1 MB |
max.poll.records | 1000–2000 | 100–250 |
| Listener type | batch | record |
The throughput profile waits to fill big fetches and processes large batches; the latency profile responds the instant any data arrives and keeps batches small so end-to-end delay stays low.
Best Practices
- Keep
max.poll.records× per-record processing time well undermax.poll.interval.ms— this is the number-one cause of avoidable rebalances. - Raise
fetch.min.byteswith a boundedfetch.max.wait.msfor throughput; leave both near default for low latency. - Ensure
max.partition.fetch.bytes≥ the broker’smax.message.bytesso a large record can never stall a partition. - Set
heartbeat.interval.msto about one-third ofsession.timeout.ms, and tune slow-processing timeouts viamax.poll.interval.ms, not the session timeout. - Match listener
concurrencyto partition count; more threads than partitions just sit idle. - Use batch listeners with a higher
max.poll.recordsto amortize per-record cost, and move slow or blocking work off the poll thread. - Watch consumer lag and rebalance frequency as your primary signals, and change one variable at a time so each gain is attributable.