Production Readiness Checklist

Standing up a Kafka cluster is easy; running one you can trust under load, on-call at 3 a.m., and through a datacenter failure is the hard part. The difference between a demo cluster and a production cluster is a set of deliberate choices about durability, security, observability, and operational process — most of which are cheap to make up front and painful to retrofit. This page is a pre-launch checklist you can walk top to bottom before a single critical workload lands on the cluster, with the concrete settings, commands, and rationale behind each item.

Durability and replication

Durability is the foundation: if a broker dies, no acknowledged message may be lost. The canonical safe configuration is replication factor 3, min.insync.replicas=2, and producers writing with acks=all. Together these guarantee a write is only acknowledged once it lives on at least two brokers, so a single broker failure never costs you data. Setting acks=all without raising min.insync.replicas above 1 is a common trap — it still acknowledges after a single replica when the ISR shrinks, silently weakening your guarantee.

Pin these as broker and topic defaults so new topics inherit them rather than relying on every team to remember:

# Broker-level defaults (server.properties)
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
auto.create.topics.enable=false

unclean.leader.election.enable=false forbids an out-of-sync replica from becoming leader, which would otherwise discard committed records to restore availability. On the producer side, require full acknowledgement and idempotence:

acks=all
enable.idempotence=true
retries=2147483647
max.in.flight.requests.per.connection=5
delivery.timeout.ms=120000

Verify durability empirically, not just in config: kill a broker in staging while a producer runs with acks=all and confirm zero records are lost and zero are duplicated. Config that looks correct but was never tested is not a guarantee.

Security

A production cluster should never accept anonymous, plaintext traffic. Enforce TLS for encryption in transit, SASL for authentication, and ACLs for authorization so each client can only touch the topics it owns.

Concern	Mechanism	Notes
Encryption in transit	TLS on `SSL`/`SASL_SSL` listeners	Disable the plaintext listener entirely in prod.
Authentication	SASL (SCRAM-SHA-512 or mTLS / OAuth)	Avoid `PLAIN` over a non-TLS listener.
Authorization	ACLs via `StandardAuthorizer` (KRaft)	Default-deny: `allow.everyone.if.no.acl.found=false`.
Secrets	External vault / K8s secrets	Never bake keystores or passwords into images.

listeners=SASL_SSL://:9093
security.inter.broker.protocol=SASL_SSL
sasl.enabled.mechanisms=SCRAM-SHA-512
authorizer.class.name=org.apache.kafka.metadata.authorizer.StandardAuthorizer
allow.everyone.if.no.acl.found=false
ssl.keystore.location=/etc/kafka/secrets/broker.keystore.jks

Grant least-privilege ACLs per principal. For example, allow an order service to produce only to its own topic:

kafka-acls.sh --bootstrap-server kafka:9093 --command-config admin.properties \
  --add --allow-principal User:order-service \
  --operation Write --topic orders

Monitoring and alerting

You cannot operate what you cannot see. Export broker, producer, and consumer JMX metrics to your time-series stack (Prometheus via the JMX exporter is the common path) and define alerts on the signals that actually predict incidents. The non-negotiable set:

Metric	Why it matters	Alert when
`UnderReplicatedPartitions`	Replication is falling behind / broker down	`> 0` for several minutes
`OfflinePartitionsCount`	Partitions with no leader = unavailable	`> 0` (page immediately)
`ActiveControllerCount`	Must be exactly 1 across the cluster	`!= 1`
Consumer group lag	Consumers can’t keep up	sustained growth / SLA breach
Request handler idle ratio	Broker CPU saturation	`< 0.3`
Disk usage	Log dirs filling up	`> 80%`

# Quick lag check for a consumer group
kafka-consumer-groups.sh --bootstrap-server kafka:9093 \
  --command-config admin.properties --describe --group payments

Output:

GROUP     TOPIC     PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
payments  orders    0          184523          184525          2
payments  orders    1          201118          201118          0

Capacity and retention

Size the cluster from the workload, not from a guess. Estimate peak write throughput, multiply by replication factor for actual disk and network load, and add headroom so a broker can fail without the survivors saturating. Set retention deliberately per topic — by time, by size, or both — and confirm log directories have room for the configured retention plus a safety margin.

# Per-topic retention example
log.retention.hours=168
log.retention.bytes=-1
log.segment.bytes=1073741824

Leave at least 30-40% disk headroom: a full log directory takes a broker offline, and recovery is far harder than prevention.

Client configuration

Brokers are only half the system. Audit every client for production-grade settings before launch: producers using acks=all and idempotence (above), consumers with explicit, intentional offset and concurrency settings, and sane timeouts everywhere.

# Consumer essentials
enable.auto.commit=false
isolation.level=read_committed
max.poll.records=500
session.timeout.ms=45000
auto.offset.reset=earliest

Disabling auto-commit and committing only after successful processing is what gives you at-least-once delivery without silent message loss on rebalance.

Disaster recovery and runbooks

Decide your recovery posture before you need it. Document the recovery point and time objectives (RPO/RTO), and back them with a tested mechanism — MirrorMaker 2 to a standby cluster for multi-region DR, plus periodic config and ACL backups. Critically, rehearse the failover; an untested DR plan is a hope, not a plan.

Finally, write runbooks for the predictable incidents (broker down, disk full, lag spike, expired certificate, leader imbalance) so the on-call engineer follows steps instead of improvising under pressure.

[ ] RF=3, min.insync.replicas=2, acks=all verified by a broker-kill test
[ ] unclean.leader.election.enable=false; auto.create.topics.enable=false
[ ] TLS on all listeners; SASL auth; default-deny ACLs per principal
[ ] Secrets externalized (vault / k8s), not in images
[ ] JMX metrics scraped; alerts on URP, offline partitions, lag, disk, controller count
[ ] Retention set per topic; >30% disk headroom on every log dir
[ ] Clients audited: idempotent producers, manual commits, sane timeouts
[ ] DR target chosen (MirrorMaker 2 / standby) and failover rehearsed
[ ] Runbooks written for broker-down, disk-full, lag-spike, cert-expiry
[ ] Capacity validated against peak load with N-1 broker headroom

Best Practices

Make durability the default at the broker level (default.replication.factor=3, min.insync.replicas=2) so teams cannot accidentally create unsafe topics.
Prove guarantees by injecting failures in staging — kill brokers, fill disks, expire certs — rather than trusting configuration on paper.
Adopt default-deny security: disable plaintext listeners, require SASL, and grant least-privilege ACLs per service principal.
Alert on leading indicators (under-replicated and offline partitions, lag growth, disk trend) so you act before an outage, not during one.
Keep at least 30-40% disk headroom and set explicit per-topic retention; a full log directory is one of the most common self-inflicted outages.
Choose and rehearse a disaster-recovery path with measured RPO/RTO; a DR plan that has never been tested will fail when it matters.
Write runbooks for the predictable incidents so on-call response is repeatable and fast under pressure.