Consumer Lag & Monitoring

Consumer lag is the single most important health signal for any Kafka consumer. It tells you how far behind your application is from the records producers are writing — in other words, how stale your data is. A small, steady lag is normal and healthy; a lag that grows without bound means your consumers cannot keep up, latency is climbing, and downstream systems are working with old information. Monitoring lag well is the difference between catching a backlog early and discovering it after an SLA breach.

What consumer lag is

Every partition has a log-end offset (LEO) — the offset of the next record to be appended, effectively the position of the newest produced record. Each consumer group tracks a committed offset per partition — the position up to which it has successfully processed and acknowledged records. Lag is simply the difference:

lag(partition) = log-end-offset(partition) - committed-offset(partition)

Total group lag is the sum across all assigned partitions. A lag of 0 means the consumer has caught up to the tail of the log. A lag of 50000 means there are 50,000 unprocessed records waiting on that partition.

Lag is measured in records, not time. A lag of 1,000 might mean one second behind on a busy topic or one hour behind on a slow one. For SLA reasoning, combine record lag with throughput, or use a tool that estimates time lag.

Measuring lag with kafka-consumer-groups

The bundled kafka-consumer-groups.sh CLI is the canonical way to inspect lag. The --describe command prints per-partition offsets and lag for a group.

kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --group order-processor

Output:

GROUP           TOPIC    PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG   CONSUMER-ID                  HOST           CLIENT-ID
order-processor orders   0          152340          152350          10    consumer-1-a1b2c3            /10.0.0.11     consumer-1
order-processor orders   1          148902          151400          2498  consumer-2-d4e5f6            /10.0.0.12     consumer-2
order-processor orders   2          150011          150011          0     consumer-3-g7h8i9            /10.0.0.13     consumer-3

Read this output carefully:

CURRENT-OFFSET is the committed offset, LOG-END-OFFSET is the LEO, and LAG is their difference.
Partition 1 is falling behind (lag 2498) while the others are healthy — this points to a hot partition or an uneven workload, not a global capacity problem.
A CONSUMER-ID of - (none) means the partition has no live owner; its lag will keep growing until a consumer is assigned.

To list every group and find unhealthy ones quickly:

kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --all-groups --describe

Monitoring lag continuously

The CLI is great for ad-hoc checks but you need automated, alertable monitoring in production. There are three common approaches.

Source	What it gives you	Best for
`kafka-consumer-groups.sh`	On-demand per-partition lag	Debugging, scripts, cron checks
Consumer client metrics (`records-lag-max`)	Lag as seen by the consumer JVM	App-level dashboards, autoscaling
Burrow / Kafka Lag Exporter	Centralized lag + status + time lag	Fleet-wide alerting in Prometheus/Grafana

The Kafka consumer exposes lag directly via JMX. The key metric is records-lag-max (the maximum lag across the consumer’s assigned partitions). In a Spring Boot app with Micrometer, these are auto-published to your registry:

import io.micrometer.core.instrument.MeterRegistry;
import org.springframework.kafka.annotation.KafkaListener;
import org.springframework.stereotype.Component;

@Component
public class OrderListener {

    private final MeterRegistry registry;

    public OrderListener(MeterRegistry registry) {
        this.registry = registry;
    }

    @KafkaListener(topics = "orders", groupId = "order-processor")
    public void onOrder(OrderEvent event) {
        // process; kafka.consumer.records.lag.max is exported automatically
        process(event);
    }

    private void process(OrderEvent event) {
        // real processing: persist, transform, forward downstream
    }
}

record OrderEvent(String orderId, String status, long amountCents) {}

Enable the metrics binding and a scrape endpoint in application.yml:

spring:
  kafka:
    consumer:
      group-id: order-processor
      properties:
        # expose consumer client JMX/Micrometer metrics
        metric.reporters: ""
management:
  endpoints:
    web:
      exposure:
        include: prometheus,metrics

For dedicated, group-agnostic monitoring, Burrow (LinkedIn) and the Kafka Lag Exporter track committed offsets against LEO for every group and expose status (OK, WARN, STOP) plus an estimated time lag derived from offset commit history. Wire either into Prometheus and alert on time lag rather than raw record counts.

Why lag grows and how to fix it

Lag accumulates whenever the consume rate is lower than the produce rate. The cause almost always falls into one of these buckets:

Cause	Symptom	Remedy
Slow per-record processing	Lag rises evenly across all partitions	Optimize the handler; batch DB writes; offload blocking I/O
Too few consumer instances	All partitions lag; consumers at high CPU	Add consumers, up to the partition count
Too few partitions	Adding consumers does not help	Increase partitions so the group can scale wider
Uneven keys / hot partition	One partition lags, others at zero	Improve key distribution or repartition
Long rebalances / stop-the-world	Lag spikes then recovers	Use cooperative rebalancing; tune `max.poll.interval.ms`

The hard ceiling on parallelism is the partition count: a group can never have more active consumers than partitions, so extra consumers sit idle. If you have 6 partitions and 6 consumers but still lag, the fix is more partitions plus more consumers, or faster processing — not more consumers alone.

Increasing partitions on an existing topic changes key-to-partition mapping for hash-partitioned producers, so ordering guarantees for a given key only hold after the change. Plan partition increases carefully and prefer over-provisioning partitions up front.

Best Practices

Alert on lag trend (sustained growth) rather than a single high reading — momentary spikes after deploys or rebalances are normal.
Track records-lag-max per consumer and an external time lag estimate together; records alone hide how stale data really is.
Size partitions for your peak target throughput so you can scale consumers horizontally without repartitioning under pressure.
Keep listener processing fast; move slow or blocking work off the poll thread and commit only after successful handling.
Watch for partitions with no owning consumer (CONSUMER-ID = -) — they signal a crashed instance or a stuck rebalance, not just a slow one.
Use cooperative rebalancing to avoid stop-the-world pauses that cause lag to spike across the whole group during scaling events.