JMX Metrics Reference
Kafka exposes its internal state through hundreds of JMX MBeans covering brokers, producers, and consumers. In production you rarely need all of them — a focused subset reliably tells you whether the cluster is healthy, whether clients are keeping up, and where a bottleneck is forming. This page is a reference of the highest-signal metrics, what each one indicates, and how to read them in an alerting strategy.
Enabling JMX
Kafka publishes JMX metrics whenever a JMX_PORT is set on the broker process. Most teams scrape these with the Prometheus JMX Exporter (run as a Java agent) and visualize them in Grafana, but you can also browse them ad hoc with jconsole or kafka-run-class.sh kafka.tools.JmxTool.
# Start the broker with JMX enabled
export JMX_PORT=9999
bin/kafka-server-start.sh config/kraft/server.properties
The JMX object names follow a consistent pattern: a domain (such as kafka.server), a type, a name, and optional attributes. You query a specific attribute like Value, OneMinuteRate, or Count.
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
Tip: Prefer rate attributes (
OneMinuteRate,MeanRate) for throughput metrics andValuefor gauges. A rawCountis a monotonic counter — alert on its derivative, not its absolute value.
Broker metrics
Broker metrics are your first line of defense. The cluster-health gauges below should sit at zero in a healthy cluster, and a sustained non-zero value almost always warrants a page.
| MBean | Attribute | What it indicates |
|---|---|---|
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions | Value | Partitions whose ISR is smaller than the replication factor. Non-zero means replicas are lagging or a broker is down — risk of data loss on failover. |
kafka.controller:type=KafkaController,name=OfflinePartitionsCount | Value | Partitions with no active leader. Any value above zero means those partitions are unavailable for reads and writes. |
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent | OneMinuteRate | Fraction of time request-handler threads are idle (0.0–1.0). Trending toward zero means the broker is CPU/IO saturated. |
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec | OneMinuteRate | Inbound byte throughput from producers. Use for capacity planning and detecting traffic spikes. |
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec | OneMinuteRate | Outbound byte throughput to consumers and replication. A sudden drop can signal stalled consumers. |
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce | Mean, 99thPercentile | End-to-end produce request latency. Rising tail latency points to disk or replication pressure. |
A quick way to read a single value from the command line:
bin/kafka-run-class.sh kafka.tools.JmxTool \
--object-name kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions \
--jmx-url service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi \
--one-time true
Output:
"time","kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:Value"
1717200000000,0
Producer metrics
Client metrics live under the kafka.producer domain, keyed by client-id. They reveal whether your application is producing efficiently and how long brokers take to acknowledge writes.
| MBean | Attribute | What it indicates |
|---|---|---|
kafka.producer:type=producer-metrics,client-id=* | record-send-rate | Records sent per second. A drop signals an application-side stall or backpressure. |
kafka.producer:type=producer-metrics,client-id=* | request-latency-avg | Average time (ms) for the broker to acknowledge a produce request. Rising values indicate broker or network pressure. |
kafka.producer:type=producer-metrics,client-id=* | record-error-rate | Failed sends per second. Should be zero; non-zero means serialization or delivery failures. |
kafka.producer:type=producer-metrics,client-id=* | buffer-available-bytes | Free space in the producer’s send buffer. Approaching zero means send() will block or throw. |
In Spring Boot you can surface these same client metrics through Micrometer, which bridges Kafka’s metric registry into Actuator and your monitoring backend.
import io.micrometer.core.instrument.MeterRegistry;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.kafka.core.MicrometerProducerListener;
import org.springframework.stereotype.Component;
@Component
public class ProducerMetricsConfig {
public ProducerMetricsConfig(KafkaTemplate<String, String> template,
MeterRegistry registry) {
template.setProducerListener(new MicrometerProducerListener<>(registry));
}
}
Consumer metrics
Consumer health is dominated by one number: lag. If a consumer group cannot keep pace with the partitions it owns, lag grows without bound and downstream data goes stale.
| MBean | Attribute | What it indicates |
|---|---|---|
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=* | records-lag-max | Maximum lag (in records) across assigned partitions. The single most important consumer SLO — alert when it grows steadily. |
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=* | records-consumed-rate | Records consumed per second. Compare against the producer rate to spot a widening gap. |
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=* | fetch-latency-avg | Average time (ms) to complete a fetch request. |
kafka.consumer:type=consumer-coordinator-metrics,client-id=* | rebalance-rate-per-hour | How often the group rebalances. Frequent rebalances stall consumption and usually point to slow processing or session timeouts. |
Warning:
records-lag-maxis reported only for partitions the consumer is currently fetching. For an authoritative, group-wide view that includes idle or stuck members, also track committed-offset lag externally (for example withkafka-consumer-groups.sh --describe).
Best practices
- Alert on the broker gauges (
UnderReplicatedPartitions,OfflinePartitionsCount) being non-zero for more than a minute — these are unambiguous health signals. - Watch
RequestHandlerAvgIdlePercentas an early saturation indicator; act before it reaches zero, not after. - Track consumer
records-lag-maxper group and alert on sustained growth rather than absolute thresholds, which vary by workload. - Always scrape rate-derived attributes (
OneMinuteRate) over raw counters so dashboards reflect current behavior, not lifetime totals. - Tag client metrics with a stable, meaningful
client-idso producer and consumer dashboards are attributable to specific services. - Standardize on the Prometheus JMX Exporter with a curated rules file instead of exporting every MBean — fewer, well-chosen series keep cardinality and cost under control.