JMX Metrics Reference

Kafka exposes its internal state through hundreds of JMX MBeans covering brokers, producers, and consumers. In production you rarely need all of them — a focused subset reliably tells you whether the cluster is healthy, whether clients are keeping up, and where a bottleneck is forming. This page is a reference of the highest-signal metrics, what each one indicates, and how to read them in an alerting strategy.

Enabling JMX

Kafka publishes JMX metrics whenever a JMX_PORT is set on the broker process. Most teams scrape these with the Prometheus JMX Exporter (run as a Java agent) and visualize them in Grafana, but you can also browse them ad hoc with jconsole or kafka-run-class.sh kafka.tools.JmxTool.

# Start the broker with JMX enabled
export JMX_PORT=9999
bin/kafka-server-start.sh config/kraft/server.properties

The JMX object names follow a consistent pattern: a domain (such as kafka.server), a type, a name, and optional attributes. You query a specific attribute like Value, OneMinuteRate, or Count.

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

Tip: Prefer rate attributes (OneMinuteRate, MeanRate) for throughput metrics and Value for gauges. A raw Count is a monotonic counter — alert on its derivative, not its absolute value.

Broker metrics

Broker metrics are your first line of defense. The cluster-health gauges below should sit at zero in a healthy cluster, and a sustained non-zero value almost always warrants a page.

MBean	Attribute	What it indicates
`kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions`	`Value`	Partitions whose ISR is smaller than the replication factor. Non-zero means replicas are lagging or a broker is down — risk of data loss on failover.
`kafka.controller:type=KafkaController,name=OfflinePartitionsCount`	`Value`	Partitions with no active leader. Any value above zero means those partitions are unavailable for reads and writes.
`kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent`	`OneMinuteRate`	Fraction of time request-handler threads are idle (0.0–1.0). Trending toward zero means the broker is CPU/IO saturated.
`kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec`	`OneMinuteRate`	Inbound byte throughput from producers. Use for capacity planning and detecting traffic spikes.
`kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec`	`OneMinuteRate`	Outbound byte throughput to consumers and replication. A sudden drop can signal stalled consumers.
`kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce`	`Mean`, `99thPercentile`	End-to-end produce request latency. Rising tail latency points to disk or replication pressure.

A quick way to read a single value from the command line:

bin/kafka-run-class.sh kafka.tools.JmxTool \
  --object-name kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions \
  --jmx-url service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi \
  --one-time true

Output:

"time","kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:Value"
1717200000000,0

Producer metrics

Client metrics live under the kafka.producer domain, keyed by client-id. They reveal whether your application is producing efficiently and how long brokers take to acknowledge writes.

MBean	Attribute	What it indicates
`kafka.producer:type=producer-metrics,client-id=*`	`record-send-rate`	Records sent per second. A drop signals an application-side stall or backpressure.
`kafka.producer:type=producer-metrics,client-id=*`	`request-latency-avg`	Average time (ms) for the broker to acknowledge a produce request. Rising values indicate broker or network pressure.
`kafka.producer:type=producer-metrics,client-id=*`	`record-error-rate`	Failed sends per second. Should be zero; non-zero means serialization or delivery failures.
`kafka.producer:type=producer-metrics,client-id=*`	`buffer-available-bytes`	Free space in the producer’s send buffer. Approaching zero means `send()` will block or throw.

In Spring Boot you can surface these same client metrics through Micrometer, which bridges Kafka’s metric registry into Actuator and your monitoring backend.

import io.micrometer.core.instrument.MeterRegistry;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.kafka.core.MicrometerProducerListener;
import org.springframework.stereotype.Component;

@Component
public class ProducerMetricsConfig {

    public ProducerMetricsConfig(KafkaTemplate<String, String> template,
                                 MeterRegistry registry) {
        template.setProducerListener(new MicrometerProducerListener<>(registry));
    }
}

Consumer metrics

Consumer health is dominated by one number: lag. If a consumer group cannot keep pace with the partitions it owns, lag grows without bound and downstream data goes stale.

MBean	Attribute	What it indicates
`kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*`	`records-lag-max`	Maximum lag (in records) across assigned partitions. The single most important consumer SLO — alert when it grows steadily.
`kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*`	`records-consumed-rate`	Records consumed per second. Compare against the producer rate to spot a widening gap.
`kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*`	`fetch-latency-avg`	Average time (ms) to complete a fetch request.
`kafka.consumer:type=consumer-coordinator-metrics,client-id=*`	`rebalance-rate-per-hour`	How often the group rebalances. Frequent rebalances stall consumption and usually point to slow processing or session timeouts.

Warning: records-lag-max is reported only for partitions the consumer is currently fetching. For an authoritative, group-wide view that includes idle or stuck members, also track committed-offset lag externally (for example with kafka-consumer-groups.sh --describe).

Best practices

Alert on the broker gauges (UnderReplicatedPartitions, OfflinePartitionsCount) being non-zero for more than a minute — these are unambiguous health signals.
Watch RequestHandlerAvgIdlePercent as an early saturation indicator; act before it reaches zero, not after.
Track consumer records-lag-max per group and alert on sustained growth rather than absolute thresholds, which vary by workload.
Always scrape rate-derived attributes (OneMinuteRate) over raw counters so dashboards reflect current behavior, not lifetime totals.
Tag client metrics with a stable, meaningful client-id so producer and consumer dashboards are attributable to specific services.
Standardize on the Prometheus JMX Exporter with a curated rules file instead of exporting every MBean — fewer, well-chosen series keep cardinality and cost under control.