Consumer Tuning

Consumer tuning is about two things at once: moving records off the broker efficiently, and staying a healthy, stable member of the consumer group so you never trigger needless rebalances. In production the most common consumer pathology is not slow fetching — it is a consumer that pulls a big batch, takes too long to process it, misses the next poll(), and gets evicted from the group, which stalls every partition it owned. Getting the fetch sizing, the poll loop budget, and the group timeouts right is what keeps a pipeline both fast and steady.

How the poll loop drives everything

A Kafka consumer is single-threaded per instance: it calls poll(), gets a batch of up to max.poll.records, processes them, and calls poll() again. Between polls a background thread sends heartbeats to prove liveness, but the broker also expects the next poll() within max.poll.interval.ms. If your processing of one batch exceeds that interval, the broker assumes the consumer is dead and rebalances its partitions away.

poll() -> [up to max.poll.records] -> process batch -> poll() -> ...
              ^ must finish a full cycle within max.poll.interval.ms

The golden rule: max.poll.records × per-record processing time must stay comfortably under max.poll.interval.ms. Tune the batch size and the interval together.

Fetch sizing: bigger reads, fewer round trips

Fetch settings control how much data the broker accumulates before answering a fetch request. Larger fetches amortize network overhead and raise throughput; smaller, time-bounded fetches lower latency. fetch.min.bytes is the amount the broker waits to accumulate, and fetch.max.wait.ms caps that wait so a quiet topic does not starve.

Property	Default	Tuned (throughput)	Effect
`fetch.min.bytes`	1	65536–1048576 (64 KB–1 MB)	Broker waits for more data per fetch
`fetch.max.wait.ms`	500	100–500	Upper bound on the wait when data is thin
`max.partition.fetch.bytes`	1048576 (1 MB)	2097152+ (2 MB+)	Max bytes returned per partition per fetch
`fetch.max.bytes`	52428800 (50 MB)	52428800+	Max bytes across all partitions per fetch
`max.poll.records`	500	1000–2000	Records returned per `poll()` call

Gotcha: max.partition.fetch.bytes must be at least as large as the broker’s max.message.bytes, or a single oversized record can wedge the consumer — it cannot fetch a message it has no room for, and progress on that partition stops.

Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092,broker2:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "events-processor");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());

// Throughput-oriented fetch sizing
props.put(ConsumerConfig.FETCH_MIN_BYTES_CONFIG, 1024 * 1024);          // wait for 1 MB
props.put(ConsumerConfig.FETCH_MAX_WAIT_MS_CONFIG, 250);                // but no longer than 250 ms
props.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, 4 * 1024 * 1024);
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 1000);
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 300000);         // 5 min processing budget

try (KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props)) {
    consumer.subscribe(List.of("events"));
    while (true) {
        ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(500));
        for (ConsumerRecord<String, String> record : records) {
            process(record.value());
        }
        consumer.commitSync();
    }
}

Group membership: heartbeats and timeouts

Group stability is governed by three timeouts. heartbeat.interval.ms is how often the background thread pings the coordinator; session.timeout.ms is how long the coordinator waits without a heartbeat before evicting the member; and max.poll.interval.ms (above) covers slow processing specifically. The heartbeat interval should be roughly one-third of the session timeout so a transient hiccup does not cause eviction.

Property	Default	Guidance
`session.timeout.ms`	45000	Eviction window for missed heartbeats; raise on flaky networks
`heartbeat.interval.ms`	3000	Keep at ~1/3 of `session.timeout.ms`
`max.poll.interval.ms`	300000	Raise it (or lower `max.poll.records`) if processing is slow

Warning: Do not “fix” rebalances by inflating max.poll.interval.ms to an hour. A consumer that legitimately died will then hold its partitions for that whole window, blocking progress. Lower max.poll.records so each batch finishes quickly, and only widen the interval as a last resort.

Concurrency in Spring Boot

A single consumer instance reads its assigned partitions sequentially. To process partitions in parallel within one application, set concurrency on the listener container — Spring runs that many consumer threads, each owning a slice of the partitions. Never set it higher than the partition count; extra threads just idle.

@KafkaListener(topics = "events", groupId = "events-processor", concurrency = "6")
public void consume(List<ConsumerRecord<String, String>> batch) {
    // batch listener amortizes per-record overhead across the whole poll
    for (ConsumerRecord<String, String> record : batch) {
        process(record.value());
    }
}

The same tuning keys live under spring.kafka.consumer:

spring:
  kafka:
    bootstrap-servers: broker1:9092,broker2:9092
    consumer:
      group-id: events-processor
      max-poll-records: 1000
      fetch-min-size: 1048576        # 1 MB
      fetch-max-wait: 250ms
      properties:
        max.partition.fetch.bytes: 4194304   # 4 MB
        max.poll.interval.ms: 300000
        session.timeout.ms: 45000
        heartbeat.interval.ms: 15000
    listener:
      type: batch

Throughput vs latency profiles

There is no universal setting — pick a profile and benchmark it with real payloads.

Setting	Throughput profile	Latency profile
`fetch.min.bytes`	1 MB	1 (default)
`fetch.max.wait.ms`	250–500	10–50
`max.partition.fetch.bytes`	4 MB+	1 MB
`max.poll.records`	1000–2000	100–250
Listener type	batch	record

The throughput profile waits to fill big fetches and processes large batches; the latency profile responds the instant any data arrives and keeps batches small so end-to-end delay stays low.

Best Practices

Keep max.poll.records × per-record processing time well under max.poll.interval.ms — this is the number-one cause of avoidable rebalances.
Raise fetch.min.bytes with a bounded fetch.max.wait.ms for throughput; leave both near default for low latency.
Ensure max.partition.fetch.bytes ≥ the broker’s max.message.bytes so a large record can never stall a partition.
Set heartbeat.interval.ms to about one-third of session.timeout.ms, and tune slow-processing timeouts via max.poll.interval.ms, not the session timeout.
Match listener concurrency to partition count; more threads than partitions just sit idle.
Use batch listeners with a higher max.poll.records to amortize per-record cost, and move slow or blocking work off the poll thread.
Watch consumer lag and rebalance frequency as your primary signals, and change one variable at a time so each gain is attributable.