Observability & Metrics

In production you cannot operate Kafka blind. You need to know whether consumers are keeping up, how fast producers are sending, and where a message spent its time as it flowed from an HTTP request through a topic to a downstream service. Spring for Apache Kafka integrates with Micrometer to publish client-level metrics and, since Spring 3.x, to emit distributed-tracing spans for every produce and consume operation. This page shows how to wire up metrics through Actuator and Prometheus, the key signals to watch, and how to enable produce/consume tracing.

Dependencies

Micrometer ships with the Kafka client, so JMX-style metrics are available out of the box. To expose them over HTTP and to add tracing you need Actuator, a Micrometer registry, and a tracing bridge.

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
  <scope>runtime</scope>
</dependency>
<!-- Tracing: Micrometer bridge + an exporter such as OTLP or Zipkin -->
<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
  <groupId>io.opentelemetry</groupId>
  <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>

Client metrics with no extra code

The Kafka producer and consumer expose dozens of metrics through their metrics() map. Spring Boot’s auto-configuration binds these into the Micrometer MeterRegistry automatically whenever a KafkaTemplate or listener container is created, as long as a registry bean exists. Expose them via Actuator:

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  endpoint:
    prometheus:
      access: read_only
  prometheus:
    metrics:
      export:
        enabled: true

Scrape GET /actuator/prometheus and you will find Kafka client gauges and counters already populated.

**Output:**

# HELP kafka_consumer_fetch_manager_records_lag The latest lag of the partition
# TYPE kafka_consumer_fetch_manager_records_lag gauge
kafka_consumer_fetch_manager_records_lag{client_id="orders-0",topic="orders",partition="2"} 0.0
# HELP kafka_consumer_fetch_manager_records_consumed_total
# TYPE kafka_consumer_fetch_manager_records_consumed_total counter
kafka_consumer_fetch_manager_records_consumed_total{client_id="orders-0",topic="orders"} 18421.0
# HELP kafka_producer_record_send_total
# TYPE kafka_producer_record_send_total counter
kafka_producer_record_send_total{client_id="producer-1"} 20114.0

Key metrics to watch

These are the signals that tell you whether your pipeline is healthy. All names below are the Micrometer/Prometheus form (dots become underscores).

Metric	Type	What it tells you
`kafka_consumer_fetch_manager_records_lag`	gauge	Per-partition lag — how far behind the consumer is. The single most important consumer SLO.
`kafka_consumer_fetch_manager_records_lag_max`	gauge	Worst lag across assigned partitions.
`kafka_consumer_fetch_manager_records_consumed_total`	counter	Throughput; rate gives records/sec consumed.
`kafka_producer_record_send_total`	counter	Records sent; rate gives the send rate.
`kafka_producer_record_error_total`	counter	Failed sends — should stay flat at zero.
`kafka_producer_request_latency_avg`	gauge	Average broker acknowledgement latency.
`spring_kafka_template` / `spring_kafka_listener`	timer	Spring-level timers added when observation is enabled (count, sum, max).

Tip: Lag from the client metric only reflects partitions this instance is assigned. For a cluster-wide view, also scrape consumer-group lag from the broker (e.g. via Kafka Exporter) so you do not miss revoked or rebalancing partitions.

Enabling Spring-level observation and tracing

Client metrics describe the transport. To get a KafkaTemplate.send timer and, more importantly, distributed-tracing spans that propagate the trace context through record headers, turn on Spring’s observation support. Set observationEnabled on both the template and the listener container factory.

spring:
  kafka:
    template:
      observation-enabled: true
    listener:
      observation-enabled: true
management:
  tracing:
    sampling:
      probability: 1.0   # sample everything in dev; lower in prod
  otlp:
    tracing:
      endpoint: http://localhost:4318/v1/traces

If you build the beans yourself rather than via properties, enable it directly:

@Configuration
public class KafkaObservabilityConfig {

    @Bean
    public KafkaTemplate<String, OrderEvent> kafkaTemplate(
            ProducerFactory<String, OrderEvent> pf) {
        KafkaTemplate<String, OrderEvent> template = new KafkaTemplate<>(pf);
        template.setObservationEnabled(true);
        return template;
    }

    @Bean
    public ConcurrentKafkaListenerContainerFactory<String, OrderEvent> kafkaListenerContainerFactory(
            ConsumerFactory<String, OrderEvent> cf) {
        var factory = new ConcurrentKafkaListenerContainerFactory<String, OrderEvent>();
        factory.setConsumerFactory(cf);
        factory.getContainerProperties().setObservationEnabled(true);
        return factory;
    }
}

With observation on, a producer creates a span when it sends and injects the W3C traceparent header into the record. The consumer extracts that header and starts a child span, so a trace started by an inbound HTTP request continues seamlessly across the topic.

public record OrderEvent(String orderId, BigDecimal amount) {}

@Component
class OrderProducer {

    private final KafkaTemplate<String, OrderEvent> template;

    OrderProducer(KafkaTemplate<String, OrderEvent> template) {
        this.template = template;
    }

    // Runs inside the caller's trace; produces a "orders send" span.
    void publish(OrderEvent event) {
        template.send("orders", event.orderId(), event);
    }
}

@Component
class OrderConsumer {

    // Receives the propagated trace context; produces a "orders receive" span.
    @KafkaListener(topics = "orders", groupId = "billing")
    void onOrder(OrderEvent event) {
        // business logic shares the same trace id as the producer
    }
}

A resulting trace shows the linked spans:

**Output:**

Trace 8f3c... (1.42s)
└─ POST /orders                 [http server]      120ms
   └─ orders send               [kafka producer]     8ms
      └─ orders receive         [kafka consumer]    14ms
         └─ chargeCard          [internal]         900ms

Warning: observation-enabled adds a span per record. For very high-throughput topics, drop management.tracing.sampling.probability (e.g. to 0.05) so tracing overhead stays bounded — metrics remain unaffected because they are not sampled.

Adding custom tags

To slice metrics by environment or region, register a MeterFilter that adds common tags to every meter, including the Kafka ones.

@Bean
MeterRegistryCustomizer<MeterRegistry> commonTags(
        @Value("${app.region}") String region) {
    return registry -> registry.config()
        .commonTags("application", "billing-service", "region", region);
}

Best Practices

Treat records_lag_max as your primary consumer SLO and alert on sustained growth, not single spikes.
Scrape broker-side consumer-group lag in addition to client metrics so rebalances and idle instances do not hide lag.
Keep tracing sampling probability at 1.0 only in lower environments; sample a small fraction in production to control cost.
Enable observationEnabled on both the template and the container factory — enabling one side breaks span linkage.
Export to a long-term store (Prometheus + Grafana for metrics, an OTLP-compatible backend for traces) rather than relying on in-memory registries.
Add commonTags for application, environment, and region so dashboards stay readable across many services.
Watch record_error_total and request_latency_avg together; rising latency before errors is your early warning of broker pressure.