Observability & Metrics
In production you cannot operate Kafka blind. You need to know whether consumers are keeping up, how fast producers are sending, and where a message spent its time as it flowed from an HTTP request through a topic to a downstream service. Spring for Apache Kafka integrates with Micrometer to publish client-level metrics and, since Spring 3.x, to emit distributed-tracing spans for every produce and consume operation. This page shows how to wire up metrics through Actuator and Prometheus, the key signals to watch, and how to enable produce/consume tracing.
Dependencies
Micrometer ships with the Kafka client, so JMX-style metrics are available out of the box. To expose them over HTTP and to add tracing you need Actuator, a Micrometer registry, and a tracing bridge.
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
<scope>runtime</scope>
</dependency>
<!-- Tracing: Micrometer bridge + an exporter such as OTLP or Zipkin -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>
Client metrics with no extra code
The Kafka producer and consumer expose dozens of metrics through their metrics() map. Spring Boot’s auto-configuration binds these into the Micrometer MeterRegistry automatically whenever a KafkaTemplate or listener container is created, as long as a registry bean exists. Expose them via Actuator:
management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
endpoint:
prometheus:
access: read_only
prometheus:
metrics:
export:
enabled: true
Scrape GET /actuator/prometheus and you will find Kafka client gauges and counters already populated.
**Output:**
# HELP kafka_consumer_fetch_manager_records_lag The latest lag of the partition
# TYPE kafka_consumer_fetch_manager_records_lag gauge
kafka_consumer_fetch_manager_records_lag{client_id="orders-0",topic="orders",partition="2"} 0.0
# HELP kafka_consumer_fetch_manager_records_consumed_total
# TYPE kafka_consumer_fetch_manager_records_consumed_total counter
kafka_consumer_fetch_manager_records_consumed_total{client_id="orders-0",topic="orders"} 18421.0
# HELP kafka_producer_record_send_total
# TYPE kafka_producer_record_send_total counter
kafka_producer_record_send_total{client_id="producer-1"} 20114.0
Key metrics to watch
These are the signals that tell you whether your pipeline is healthy. All names below are the Micrometer/Prometheus form (dots become underscores).
| Metric | Type | What it tells you |
|---|---|---|
kafka_consumer_fetch_manager_records_lag | gauge | Per-partition lag — how far behind the consumer is. The single most important consumer SLO. |
kafka_consumer_fetch_manager_records_lag_max | gauge | Worst lag across assigned partitions. |
kafka_consumer_fetch_manager_records_consumed_total | counter | Throughput; rate gives records/sec consumed. |
kafka_producer_record_send_total | counter | Records sent; rate gives the send rate. |
kafka_producer_record_error_total | counter | Failed sends — should stay flat at zero. |
kafka_producer_request_latency_avg | gauge | Average broker acknowledgement latency. |
spring_kafka_template / spring_kafka_listener | timer | Spring-level timers added when observation is enabled (count, sum, max). |
Tip: Lag from the client metric only reflects partitions this instance is assigned. For a cluster-wide view, also scrape consumer-group lag from the broker (e.g. via Kafka Exporter) so you do not miss revoked or rebalancing partitions.
Enabling Spring-level observation and tracing
Client metrics describe the transport. To get a KafkaTemplate.send timer and, more importantly, distributed-tracing spans that propagate the trace context through record headers, turn on Spring’s observation support. Set observationEnabled on both the template and the listener container factory.
spring:
kafka:
template:
observation-enabled: true
listener:
observation-enabled: true
management:
tracing:
sampling:
probability: 1.0 # sample everything in dev; lower in prod
otlp:
tracing:
endpoint: http://localhost:4318/v1/traces
If you build the beans yourself rather than via properties, enable it directly:
@Configuration
public class KafkaObservabilityConfig {
@Bean
public KafkaTemplate<String, OrderEvent> kafkaTemplate(
ProducerFactory<String, OrderEvent> pf) {
KafkaTemplate<String, OrderEvent> template = new KafkaTemplate<>(pf);
template.setObservationEnabled(true);
return template;
}
@Bean
public ConcurrentKafkaListenerContainerFactory<String, OrderEvent> kafkaListenerContainerFactory(
ConsumerFactory<String, OrderEvent> cf) {
var factory = new ConcurrentKafkaListenerContainerFactory<String, OrderEvent>();
factory.setConsumerFactory(cf);
factory.getContainerProperties().setObservationEnabled(true);
return factory;
}
}
With observation on, a producer creates a span when it sends and injects the W3C traceparent header into the record. The consumer extracts that header and starts a child span, so a trace started by an inbound HTTP request continues seamlessly across the topic.
public record OrderEvent(String orderId, BigDecimal amount) {}
@Component
class OrderProducer {
private final KafkaTemplate<String, OrderEvent> template;
OrderProducer(KafkaTemplate<String, OrderEvent> template) {
this.template = template;
}
// Runs inside the caller's trace; produces a "orders send" span.
void publish(OrderEvent event) {
template.send("orders", event.orderId(), event);
}
}
@Component
class OrderConsumer {
// Receives the propagated trace context; produces a "orders receive" span.
@KafkaListener(topics = "orders", groupId = "billing")
void onOrder(OrderEvent event) {
// business logic shares the same trace id as the producer
}
}
A resulting trace shows the linked spans:
**Output:**
Trace 8f3c... (1.42s)
└─ POST /orders [http server] 120ms
└─ orders send [kafka producer] 8ms
└─ orders receive [kafka consumer] 14ms
└─ chargeCard [internal] 900ms
Warning:
observation-enabledadds a span per record. For very high-throughput topics, dropmanagement.tracing.sampling.probability(e.g. to0.05) so tracing overhead stays bounded — metrics remain unaffected because they are not sampled.
Adding custom tags
To slice metrics by environment or region, register a MeterFilter that adds common tags to every meter, including the Kafka ones.
@Bean
MeterRegistryCustomizer<MeterRegistry> commonTags(
@Value("${app.region}") String region) {
return registry -> registry.config()
.commonTags("application", "billing-service", "region", region);
}
Best Practices
- Treat
records_lag_maxas your primary consumer SLO and alert on sustained growth, not single spikes. - Scrape broker-side consumer-group lag in addition to client metrics so rebalances and idle instances do not hide lag.
- Keep tracing sampling probability at
1.0only in lower environments; sample a small fraction in production to control cost. - Enable
observationEnabledon both the template and the container factory — enabling one side breaks span linkage. - Export to a long-term store (Prometheus + Grafana for metrics, an OTLP-compatible backend for traces) rather than relying on in-memory registries.
- Add
commonTagsfor application, environment, and region so dashboards stay readable across many services. - Watch
record_error_totalandrequest_latency_avgtogether; rising latency before errors is your early warning of broker pressure.