Kafka Best Practices

Apache Kafka is forgiving in development and unforgiving in production: the defaults that make a local demo work will silently drop messages, stall consumers, or melt a broker under real load. This page consolidates the practices that matter most when you run Kafka for keeps — covering topic design, producer and consumer configuration, durability, schema governance, security, and operations. Treat it as a checklist you revisit before every service goes live.

Topic and partition design

Topics are the long-lived contract between teams, so design them deliberately. Choose partition counts based on your target throughput and the parallelism you need: a consumer group can have at most one active consumer per partition, so partitions set your ceiling for horizontal scaling. Over-partitioning wastes broker file handles and increases end-to-end latency and rebalance time; under-partitioning caps throughput. A practical starting point is to estimate peak MB/s per partition (often 1–10 MB/s) and round up, leaving headroom for growth since increasing partitions later breaks key-based ordering.

Adopt a consistent, hierarchical naming convention so topics are self-describing and easy to govern with ACLs and quotas.

<domain>.<entity>.<event-type>.<version>
orders.order.created.v1
payments.payment.settled.v1

Concern	Guidance
Partition count	Size for peak throughput plus headroom; avoid changing it after launch
Replication factor	3 in production; never 1
Naming	Lowercase, dot- or hyphen-delimited, include a version segment
Retention	Set explicit `retention.ms`/`retention.bytes`; use compaction for changelog/state topics
Keys	Use a stable business key to preserve per-entity ordering

Avoid auto.create.topics.enable=true in production. Auto-created topics inherit broker defaults (often RF=1) and bypass your naming and review process.

Producer best practices

A producer’s defaults favor latency over safety. For any data you cannot afford to lose, enable acknowledgement from all in-sync replicas and turn on idempotence to prevent duplicates from retries. Idempotence (enable.idempotence=true) automatically sets acks=all, bounds in-flight requests, and enables infinite retries safely — it is the single most important producer setting for correctness.

Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092,broker2:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5);
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "zstd");
props.put(ProducerConfig.LINGER_MS_CONFIG, 10);

try (KafkaProducer<String, String> producer = new KafkaProducer<>(props)) {
    var record = new ProducerRecord<>("orders.order.created.v1", order.id(), payload);
    producer.send(record, (metadata, ex) -> {
        if (ex != null) {
            log.error("Publish failed for key {}", order.id(), ex);
        }
    });
}

Always supply a key when ordering or co-partitioning matters, and always handle the send callback — a fire-and-forget send() whose future you never inspect will swallow failures. Use linger.ms and compression (zstd or lz4) to batch effectively and cut network and storage cost.

Consumer best practices

Make consumers idempotent first, then tune throughput. Because Kafka guarantees at-least-once delivery in most configurations, your handler will occasionally see a message twice; design downstream writes to be safe on replay (upserts, dedup keys, or a processed-id table). Disable auto-commit and commit offsets only after the work is durably done, so a crash mid-processing reprocesses rather than skips.

@KafkaListener(topics = "orders.order.created.v1", groupId = "fulfilment")
public void onOrder(ConsumerRecord<String, OrderCreated> record, Acknowledgment ack) {
    orderService.process(record.key(), record.value()); // idempotent upsert
    ack.acknowledge(); // manual commit after success
}

spring:
  kafka:
    consumer:
      enable-auto-commit: false
      max-poll-records: 500
      fetch-min-size: 1048576   # 1 MiB - batch fetches
      properties:
        max.poll.interval.ms: 300000
    listener:
      ack-mode: manual

Keep per-record processing fast; if work is slow, lower max.poll.records or offload to a worker pool so you do not exceed max.poll.interval.ms and trigger a rebalance. Route poison messages to a dead-letter topic instead of blocking the partition forever.

Durability and availability

Durability comes from the interaction of three settings working together. A replication factor of 3 tolerates one broker loss while staying writable; min.insync.replicas=2 ensures acks=all writes are persisted on at least two replicas before being acknowledged.

# broker / topic
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false

Setting	Recommended	Why
`replication.factor`	3	Survive a broker failure with quorum intact
`min.insync.replicas`	2	Reject writes if data isn’t replicated enough
`acks` (producer)	all	Wait for the ISR to persist
`unclean.leader.election.enable`	false	Never elect an out-of-sync replica (prevents data loss)

RF=3 with min.insync.replicas=2 is the safe combination. Setting min.insync.replicas equal to RF means a single broker outage halts all writes — leave one replica of slack.

Schema governance and security

Treat event payloads as a versioned API. Use a Schema Registry (Avro, Protobuf, or JSON Schema) with backward/forward compatibility enabled so producers and consumers can evolve independently. Never make breaking changes in place — add optional fields, and cut a new topic version for incompatible changes.

For security, enable TLS for encryption in transit, authenticate clients with SASL (SCRAM or OAuth/mTLS), and authorize with ACLs scoped per topic and consumer group. Grant least privilege: a service should only read or write the topics it owns.

security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512
ssl.truststore.location=/etc/kafka/secrets/truststore.jks

Operations and monitoring

You cannot operate what you cannot see. Track consumer lag per group and partition as your primary health signal, alongside under-replicated partitions, ISR shrink/expand rates, request latency, and broker disk usage. Expose JMX metrics to Prometheus and alert on sustained lag growth and any non-zero under-replicated partition count.

kafka-consumer-groups.sh --bootstrap-server broker1:9092 \
  --describe --group fulfilment

Output:

GROUP       TOPIC                       PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
fulfilment  orders.order.created.v1     0          145820          145820          0
fulfilment  orders.order.created.v1     1          145210          145298          88

Test failure paths deliberately: kill brokers in staging, verify failover, and rehearse partition reassignment and rolling upgrades before you need them in an incident.

Best Practices

Enable enable.idempotence=true and acks=all on every producer carrying important data, and always handle the send callback.
Run RF=3 with min.insync.replicas=2 and disable unclean leader election to guarantee durability without freezing on a single outage.
Disable consumer auto-commit, make handlers idempotent, and commit only after successful processing; route poison messages to a dead-letter topic.
Size partitions for peak throughput plus headroom and avoid increasing them later, since it breaks key-based ordering.
Disable auto.create.topics.enable and enforce a versioned naming convention plus schema compatibility so topics evolve safely.
Secure clusters with TLS, SASL, and least-privilege ACLs, and monitor consumer lag and under-replicated partitions with alerts.