Why Use Kafka?
As systems grow, the hardest problem is rarely a single service — it is the web of integrations between them. Apache Kafka exists to solve that problem: it turns brittle point-to-point calls into a durable, replayable stream of events that any number of services can produce to and consume from independently. This page explains the specific pains Kafka removes, how it stacks up against other messaging systems, and — just as importantly — when you should reach for something else.
The problems Kafka solves
In a small system, services call each other directly over HTTP or gRPC. That works until you have a dozen services, each needing data from several others. The result is a tangle of synchronous dependencies where one slow or failed service cascades into outages everywhere.
Kafka addresses four recurring pains:
- Tight coupling. With direct calls, the producer must know every consumer and wait for them. With Kafka, a producer writes an event once and is done; consumers read at their own pace. Adding a new consumer requires zero changes to the producer.
- Integration sprawl. N services talking to M services is an N×M problem. Kafka collapses it to N+M: everyone connects to the broker, not to each other.
- Lost data. A crashed HTTP consumer drops the request. Kafka persists every event to disk with a configurable retention period, so consumers can fail, restart, and resume exactly where they left off — or replay from the beginning.
- Scale. A single Kafka cluster can sustain millions of messages per second by partitioning each topic across brokers and consumer instances, scaling horizontally as load grows.
The key mental shift is that Kafka is a distributed commit log, not a queue. Messages are not deleted when read; they sit in an ordered, immutable log that multiple independent consumer groups can read at different offsets.
Treat Kafka topics as your durable source of truth for events. Because data is retained and replayable, you can spin up a brand-new service and have it rebuild its entire state from historical events — something a traditional queue cannot offer.
Kafka vs RabbitMQ vs SQS/SNS vs Pulsar
There is no single “best” broker — each is tuned for different trade-offs. The table below compares the most common choices on the dimensions that matter in production.
| Dimension | Kafka | RabbitMQ | AWS SQS/SNS | Apache Pulsar |
|---|---|---|---|---|
| Model | Distributed log (pull) | Queue/broker (push) | Managed queue (SQS) + pub/sub (SNS) | Distributed log + queue |
| Message replay | Yes, by offset | No (consumed = gone) | No | Yes |
| Ordering | Per partition | Per queue | FIFO queues only | Per partition |
| Throughput | Very high (millions/sec) | High | High (auto-scaled) | Very high |
| Complex routing | Limited | Excellent (exchanges) | Basic (SNS filtering) | Good |
| Retention | Hours to forever | Until consumed | Up to 14 days | Tiered (incl. cold storage) |
| Ops burden | Self-managed cluster (or MSK/Confluent) | Self-managed (or CloudAMQP) | Fully managed | Self-managed cluster |
| Best for | High-volume event streaming, analytics | Task queues, RPC, intricate routing | Simple decoupling on AWS | Streaming + queuing with geo-replication |
A useful rule of thumb: choose Kafka when events are valuable beyond their immediate consumption (analytics, replay, multiple readers); choose RabbitMQ when you need flexible routing and per-message acknowledgement for task processing; choose SQS/SNS when you want zero operational overhead on AWS; and choose Pulsar when you need both streaming and queue semantics with strong multi-tenancy or geo-replication.
When Kafka is the right fit
Kafka shines in scenarios where data flows continuously and many parties care about it:
- Event-driven microservices — services publish domain events (
OrderPlaced,PaymentCaptured) that others react to asynchronously. - Stream processing and analytics — feeding real-time aggregations, fraud detection, or dashboards via Kafka Streams or Flink.
- Log and metrics aggregation — collecting high-volume telemetry from many sources into a central pipeline.
- Change data capture (CDC) — streaming database changes downstream with tools like Debezium.
- Decoupling at scale — when throughput is high and you need consumers to scale independently of producers.
A minimal producer shows how little code it takes to publish an event with Spring for Apache Kafka:
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.stereotype.Service;
public record OrderPlaced(String orderId, String customerId, long amountCents) {}
@Service
public class OrderEventPublisher {
private final KafkaTemplate<String, OrderPlaced> kafkaTemplate;
public OrderEventPublisher(KafkaTemplate<String, OrderPlaced> kafkaTemplate) {
this.kafkaTemplate = kafkaTemplate;
}
public void publish(OrderPlaced event) {
// The order id is the key, guaranteeing per-order ordering on one partition
kafkaTemplate.send("orders", event.orderId(), event);
}
}
You can confirm messages are flowing using the bundled console consumer:
kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--topic orders \
--from-beginning
Output:
{"orderId":"o-1001","customerId":"c-42","amountCents":4999}
{"orderId":"o-1002","customerId":"c-42","amountCents":1299}
Processed a total of 2 messages
When Kafka is the wrong fit
Kafka is powerful, but it is not free. Reach for something simpler when:
- You need priority queues. Kafka has no concept of message priority — every message in a partition is processed in order. RabbitMQ handles this natively.
- You need complex routing. Topic exchanges, header-based routing, and dead-letter routing are RabbitMQ’s strength. Kafka pushes routing logic into consumers.
- The app is tiny and low-volume. A handful of messages per minute does not justify running and operating a Kafka cluster. A managed SQS queue or even a database table is simpler.
- You need per-message TTL or delayed delivery. Kafka retains by time/size at the topic level, not per message. Delayed jobs are awkward to model.
- You need request/reply RPC. Kafka can do it, but it is unnatural; a synchronous protocol or RabbitMQ RPC fits better.
Operating Kafka yourself means managing brokers, partitions, replication, and (in older deployments) ZooKeeper. Modern clusters run in KRaft mode without ZooKeeper, and managed offerings like Confluent Cloud or Amazon MSK remove most of this burden — factor that cost in before committing.
Best practices
- Choose Kafka when events have lasting value (replay, multiple consumers, analytics), not merely to move a single task between two services.
- Model topics around business domains and use a meaningful message key to preserve ordering where it matters.
- Set retention deliberately — long enough for replay and recovery, short enough to control storage costs.
- Prefer a managed offering (MSK, Confluent Cloud) early on so your team focuses on data flow rather than cluster operations.
- Do not force priority, complex routing, or delayed delivery onto Kafka; pair it with RabbitMQ or a scheduler when those patterns dominate.
- Benchmark with realistic volume before adopting — if traffic is low and stays low, a simpler queue is the better engineering choice.