Skip to content
Apache Kafka kf admin-ops 4 min read

JMX Metrics Reference

Kafka exposes its internal state through hundreds of JMX MBeans covering brokers, producers, and consumers. In production you rarely need all of them — a focused subset reliably tells you whether the cluster is healthy, whether clients are keeping up, and where a bottleneck is forming. This page is a reference of the highest-signal metrics, what each one indicates, and how to read them in an alerting strategy.

Enabling JMX

Kafka publishes JMX metrics whenever a JMX_PORT is set on the broker process. Most teams scrape these with the Prometheus JMX Exporter (run as a Java agent) and visualize them in Grafana, but you can also browse them ad hoc with jconsole or kafka-run-class.sh kafka.tools.JmxTool.

# Start the broker with JMX enabled
export JMX_PORT=9999
bin/kafka-server-start.sh config/kraft/server.properties

The JMX object names follow a consistent pattern: a domain (such as kafka.server), a type, a name, and optional attributes. You query a specific attribute like Value, OneMinuteRate, or Count.

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

Tip: Prefer rate attributes (OneMinuteRate, MeanRate) for throughput metrics and Value for gauges. A raw Count is a monotonic counter — alert on its derivative, not its absolute value.

Broker metrics

Broker metrics are your first line of defense. The cluster-health gauges below should sit at zero in a healthy cluster, and a sustained non-zero value almost always warrants a page.

MBeanAttributeWhat it indicates
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitionsValuePartitions whose ISR is smaller than the replication factor. Non-zero means replicas are lagging or a broker is down — risk of data loss on failover.
kafka.controller:type=KafkaController,name=OfflinePartitionsCountValuePartitions with no active leader. Any value above zero means those partitions are unavailable for reads and writes.
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercentOneMinuteRateFraction of time request-handler threads are idle (0.0–1.0). Trending toward zero means the broker is CPU/IO saturated.
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSecOneMinuteRateInbound byte throughput from producers. Use for capacity planning and detecting traffic spikes.
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSecOneMinuteRateOutbound byte throughput to consumers and replication. A sudden drop can signal stalled consumers.
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=ProduceMean, 99thPercentileEnd-to-end produce request latency. Rising tail latency points to disk or replication pressure.

A quick way to read a single value from the command line:

bin/kafka-run-class.sh kafka.tools.JmxTool \
  --object-name kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions \
  --jmx-url service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi \
  --one-time true

Output:

"time","kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:Value"
1717200000000,0

Producer metrics

Client metrics live under the kafka.producer domain, keyed by client-id. They reveal whether your application is producing efficiently and how long brokers take to acknowledge writes.

MBeanAttributeWhat it indicates
kafka.producer:type=producer-metrics,client-id=*record-send-rateRecords sent per second. A drop signals an application-side stall or backpressure.
kafka.producer:type=producer-metrics,client-id=*request-latency-avgAverage time (ms) for the broker to acknowledge a produce request. Rising values indicate broker or network pressure.
kafka.producer:type=producer-metrics,client-id=*record-error-rateFailed sends per second. Should be zero; non-zero means serialization or delivery failures.
kafka.producer:type=producer-metrics,client-id=*buffer-available-bytesFree space in the producer’s send buffer. Approaching zero means send() will block or throw.

In Spring Boot you can surface these same client metrics through Micrometer, which bridges Kafka’s metric registry into Actuator and your monitoring backend.

import io.micrometer.core.instrument.MeterRegistry;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.kafka.core.MicrometerProducerListener;
import org.springframework.stereotype.Component;

@Component
public class ProducerMetricsConfig {

    public ProducerMetricsConfig(KafkaTemplate<String, String> template,
                                 MeterRegistry registry) {
        template.setProducerListener(new MicrometerProducerListener<>(registry));
    }
}

Consumer metrics

Consumer health is dominated by one number: lag. If a consumer group cannot keep pace with the partitions it owns, lag grows without bound and downstream data goes stale.

MBeanAttributeWhat it indicates
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*records-lag-maxMaximum lag (in records) across assigned partitions. The single most important consumer SLO — alert when it grows steadily.
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*records-consumed-rateRecords consumed per second. Compare against the producer rate to spot a widening gap.
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*fetch-latency-avgAverage time (ms) to complete a fetch request.
kafka.consumer:type=consumer-coordinator-metrics,client-id=*rebalance-rate-per-hourHow often the group rebalances. Frequent rebalances stall consumption and usually point to slow processing or session timeouts.

Warning: records-lag-max is reported only for partitions the consumer is currently fetching. For an authoritative, group-wide view that includes idle or stuck members, also track committed-offset lag externally (for example with kafka-consumer-groups.sh --describe).

Best practices

  • Alert on the broker gauges (UnderReplicatedPartitions, OfflinePartitionsCount) being non-zero for more than a minute — these are unambiguous health signals.
  • Watch RequestHandlerAvgIdlePercent as an early saturation indicator; act before it reaches zero, not after.
  • Track consumer records-lag-max per group and alert on sustained growth rather than absolute thresholds, which vary by workload.
  • Always scrape rate-derived attributes (OneMinuteRate) over raw counters so dashboards reflect current behavior, not lifetime totals.
  • Tag client metrics with a stable, meaningful client-id so producer and consumer dashboards are attributable to specific services.
  • Standardize on the Prometheus JMX Exporter with a curated rules file instead of exporting every MBean — fewer, well-chosen series keep cardinality and cost under control.
Last updated June 1, 2026
Was this helpful?