Consumer Lag & Monitoring
Consumer lag is the single most important health signal for any Kafka consumer. It tells you how far behind your application is from the records producers are writing — in other words, how stale your data is. A small, steady lag is normal and healthy; a lag that grows without bound means your consumers cannot keep up, latency is climbing, and downstream systems are working with old information. Monitoring lag well is the difference between catching a backlog early and discovering it after an SLA breach.
What consumer lag is
Every partition has a log-end offset (LEO) — the offset of the next record to be appended, effectively the position of the newest produced record. Each consumer group tracks a committed offset per partition — the position up to which it has successfully processed and acknowledged records. Lag is simply the difference:
lag(partition) = log-end-offset(partition) - committed-offset(partition)
Total group lag is the sum across all assigned partitions. A lag of 0 means the consumer has caught up to the tail of the log. A lag of 50000 means there are 50,000 unprocessed records waiting on that partition.
Lag is measured in records, not time. A lag of 1,000 might mean one second behind on a busy topic or one hour behind on a slow one. For SLA reasoning, combine record lag with throughput, or use a tool that estimates time lag.
Measuring lag with kafka-consumer-groups
The bundled kafka-consumer-groups.sh CLI is the canonical way to inspect lag. The --describe command prints per-partition offsets and lag for a group.
kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--describe \
--group order-processor
Output:
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
order-processor orders 0 152340 152350 10 consumer-1-a1b2c3 /10.0.0.11 consumer-1
order-processor orders 1 148902 151400 2498 consumer-2-d4e5f6 /10.0.0.12 consumer-2
order-processor orders 2 150011 150011 0 consumer-3-g7h8i9 /10.0.0.13 consumer-3
Read this output carefully:
CURRENT-OFFSETis the committed offset,LOG-END-OFFSETis the LEO, andLAGis their difference.- Partition
1is falling behind (lag 2498) while the others are healthy — this points to a hot partition or an uneven workload, not a global capacity problem. - A
CONSUMER-IDof-(none) means the partition has no live owner; its lag will keep growing until a consumer is assigned.
To list every group and find unhealthy ones quickly:
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --all-groups --describe
Monitoring lag continuously
The CLI is great for ad-hoc checks but you need automated, alertable monitoring in production. There are three common approaches.
| Source | What it gives you | Best for |
|---|---|---|
kafka-consumer-groups.sh | On-demand per-partition lag | Debugging, scripts, cron checks |
Consumer client metrics (records-lag-max) | Lag as seen by the consumer JVM | App-level dashboards, autoscaling |
| Burrow / Kafka Lag Exporter | Centralized lag + status + time lag | Fleet-wide alerting in Prometheus/Grafana |
The Kafka consumer exposes lag directly via JMX. The key metric is records-lag-max (the maximum lag across the consumer’s assigned partitions). In a Spring Boot app with Micrometer, these are auto-published to your registry:
import io.micrometer.core.instrument.MeterRegistry;
import org.springframework.kafka.annotation.KafkaListener;
import org.springframework.stereotype.Component;
@Component
public class OrderListener {
private final MeterRegistry registry;
public OrderListener(MeterRegistry registry) {
this.registry = registry;
}
@KafkaListener(topics = "orders", groupId = "order-processor")
public void onOrder(OrderEvent event) {
// process; kafka.consumer.records.lag.max is exported automatically
process(event);
}
private void process(OrderEvent event) {
// real processing: persist, transform, forward downstream
}
}
record OrderEvent(String orderId, String status, long amountCents) {}
Enable the metrics binding and a scrape endpoint in application.yml:
spring:
kafka:
consumer:
group-id: order-processor
properties:
# expose consumer client JMX/Micrometer metrics
metric.reporters: ""
management:
endpoints:
web:
exposure:
include: prometheus,metrics
For dedicated, group-agnostic monitoring, Burrow (LinkedIn) and the Kafka Lag Exporter track committed offsets against LEO for every group and expose status (OK, WARN, STOP) plus an estimated time lag derived from offset commit history. Wire either into Prometheus and alert on time lag rather than raw record counts.
Why lag grows and how to fix it
Lag accumulates whenever the consume rate is lower than the produce rate. The cause almost always falls into one of these buckets:
| Cause | Symptom | Remedy |
|---|---|---|
| Slow per-record processing | Lag rises evenly across all partitions | Optimize the handler; batch DB writes; offload blocking I/O |
| Too few consumer instances | All partitions lag; consumers at high CPU | Add consumers, up to the partition count |
| Too few partitions | Adding consumers does not help | Increase partitions so the group can scale wider |
| Uneven keys / hot partition | One partition lags, others at zero | Improve key distribution or repartition |
| Long rebalances / stop-the-world | Lag spikes then recovers | Use cooperative rebalancing; tune max.poll.interval.ms |
The hard ceiling on parallelism is the partition count: a group can never have more active consumers than partitions, so extra consumers sit idle. If you have 6 partitions and 6 consumers but still lag, the fix is more partitions plus more consumers, or faster processing — not more consumers alone.
Increasing partitions on an existing topic changes key-to-partition mapping for hash-partitioned producers, so ordering guarantees for a given key only hold after the change. Plan partition increases carefully and prefer over-provisioning partitions up front.
Best Practices
- Alert on lag trend (sustained growth) rather than a single high reading — momentary spikes after deploys or rebalances are normal.
- Track
records-lag-maxper consumer and an external time lag estimate together; records alone hide how stale data really is. - Size partitions for your peak target throughput so you can scale consumers horizontally without repartitioning under pressure.
- Keep listener processing fast; move slow or blocking work off the poll thread and commit only after successful handling.
- Watch for partitions with no owning consumer (
CONSUMER-ID = -) — they signal a crashed instance or a stuck rebalance, not just a slow one. - Use cooperative rebalancing to avoid stop-the-world pauses that cause lag to spike across the whole group during scaling events.