Kafka Streams Introduction
Kafka Streams is a Java library for building real-time, stateful stream-processing applications that read from and write to Kafka topics. Unlike most stream processors, it is not a separate framework you deploy to a cluster — it is a dependency you add to an ordinary JVM application. That means your processing logic runs inside your own service, scales the same way your service scales, and inherits Kafka’s durability and fault-tolerance guarantees without any extra infrastructure to operate.
Why a library and not a cluster
The defining trait of Kafka Streams is that there is no processing cluster. You write a normal main() method (or a Spring Boot app), pull in kafka-streams, define a topology, and run it. Behind the scenes the library uses Kafka’s consumer and producer protocols to read input, process records, manage state, and write output.
This has practical consequences in production:
- No new operational surface. You already run Kafka; you do not also run a Spark or Flink cluster, a resource manager, or a job scheduler.
- Standard deployment. Package it as a JAR, container, or Spring Boot service and deploy it like any other microservice — Kubernetes, ECS, a plain VM, anything that runs the JVM.
- Elastic scaling by instances. Run more copies of the application and Kafka’s consumer-group protocol automatically redistributes work across them.
Kafka Streams only talks to Kafka. Both its input and output are Kafka topics, and its state and fault tolerance are also backed by Kafka topics. It is purpose-built for Kafka, not a general-purpose compute engine.
Core abstractions: KStream, KTable, GlobalKTable
Kafka Streams gives you two primary views over a topic, plus a globally-replicated variant.
| Abstraction | Models | Semantics | Typical use |
|---|---|---|---|
KStream<K,V> | An unbounded stream of records | Every record is an independent event (insert/append) | Clicks, payments, page views, raw events |
KTable<K,V> | A changelog stream as a table | Latest value per key (upsert; null = delete) | Current account balance, latest user profile |
GlobalKTable<K,V> | A fully replicated table | Entire table on every instance, not partitioned | Small reference/lookup data joined to a stream |
A KStream treats each record as a fact that happened. A KTable interprets a topic as a changelog: records with the same key overwrite each other, so the table holds the most recent value per key. A GlobalKTable loads the whole topic into every application instance, which makes it ideal for enrichment lookups (for example, joining a stream of orders against a small countries table) without repartitioning the stream.
The topology: a processing DAG
You describe processing declaratively with the DSL, which builds a topology — a directed acyclic graph (DAG) of source nodes (topics), processor nodes (operations like filter, map, groupBy, aggregate, join), and sink nodes (output topics). When the application starts, this topology is instantiated and runs continuously.
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.Produced;
import java.util.Properties;
public class WordLengthApp {
public static void main(String[] args) {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "word-length-app");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> words = builder.stream("words-input");
words.filter((key, value) -> value != null && !value.isBlank())
.mapValues(value -> String.valueOf(value.length()))
.to("word-lengths-output", Produced.with(Serdes.String(), Serdes.String()));
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
}
Send a few records to words-input and read the result topic:
kafka-console-producer.sh --bootstrap-server localhost:9092 --topic words-input
> kafka
> streams
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
--topic word-lengths-output --from-beginning
Output:
5
7
Scaling with partitions
Parallelism in Kafka Streams is driven by topic partitions. The application.id becomes the consumer-group id, so every instance you launch joins the same group and is assigned a subset of partitions. A topology is split into tasks, one per input partition, and tasks are distributed across instances and threads (num.stream.threads).
topic with 6 partitions
|
+-- instance A: tasks 0,1,2
+-- instance B: tasks 3,4,5
If instance B dies, Kafka rebalances its tasks (and their state) onto instance A. The maximum useful parallelism is the number of input partitions — scaling past that leaves extra instances idle. State for each task is kept in a local store (RocksDB by default) and continuously backed up to a Kafka changelog topic, so a restarted or relocated task can restore its state exactly.
Exactly-once processing
Kafka Streams supports exactly-once semantics (EOS) end to end: consuming input, updating state, and producing output all commit atomically using Kafka transactions. Enable it with a single configuration key.
processing.guarantee=exactly_once_v2
With exactly_once_v2, a record cannot be processed twice and state updates cannot be partially applied, even across failures and rebalances. The default, at_least_once, is faster but may reprocess records after a crash. EOS adds modest latency from transactional commits, so reserve it for pipelines where duplicates are unacceptable (financial ledgers, billing, deduplicated counts).
How it compares to Spark and Flink
Spark Structured Streaming and Apache Flink are powerful, but they are cluster-based engines: you submit jobs to a managed runtime with its own scheduler, workers, and resource model.
| Aspect | Kafka Streams | Spark / Flink |
|---|---|---|
| Deployment | Library inside your app | Job submitted to a cluster |
| Infrastructure | Just a Kafka cluster | Engine cluster + resource manager |
| Sources/sinks | Kafka only | Many connectors (HDFS, JDBC, files, Kafka, …) |
| Scaling | Add app instances | Add/allocate cluster workers |
| Best for | Kafka-to-Kafka microservices | Mixed-source, heavy batch + streaming analytics |
Choose Kafka Streams when your data lives in Kafka and you want stream processing to be just another microservice. Choose Spark or Flink when you need a general engine spanning many systems or large-scale batch alongside streaming.
Best Practices
- Always set a stable, unique
application.id; it controls the consumer group, internal topic names, and state directories. - Size input topic partitions for your target parallelism up front — you cannot exceed partition count in parallelism, and changing partitions later is disruptive.
- Specify explicit
Serdesper operator rather than relying only on defaults, especially for keys after agroupByor repartition. - Register a JVM shutdown hook that calls
streams.close()so tasks leave the group cleanly and commit offsets. - Use
exactly_once_v2only where duplicates are truly unacceptable; otherwise the defaultat_least_oncegives better throughput. - Provision persistent, fast local disk for RocksDB state stores so state restoration from changelogs stays quick after rebalances.