Kafka Streams Introduction

Kafka Streams is a Java library for building real-time, stateful stream-processing applications that read from and write to Kafka topics. Unlike most stream processors, it is not a separate framework you deploy to a cluster — it is a dependency you add to an ordinary JVM application. That means your processing logic runs inside your own service, scales the same way your service scales, and inherits Kafka’s durability and fault-tolerance guarantees without any extra infrastructure to operate.

Why a library and not a cluster

The defining trait of Kafka Streams is that there is no processing cluster. You write a normal main() method (or a Spring Boot app), pull in kafka-streams, define a topology, and run it. Behind the scenes the library uses Kafka’s consumer and producer protocols to read input, process records, manage state, and write output.

This has practical consequences in production:

No new operational surface. You already run Kafka; you do not also run a Spark or Flink cluster, a resource manager, or a job scheduler.
Standard deployment. Package it as a JAR, container, or Spring Boot service and deploy it like any other microservice — Kubernetes, ECS, a plain VM, anything that runs the JVM.
Elastic scaling by instances. Run more copies of the application and Kafka’s consumer-group protocol automatically redistributes work across them.

Kafka Streams only talks to Kafka. Both its input and output are Kafka topics, and its state and fault tolerance are also backed by Kafka topics. It is purpose-built for Kafka, not a general-purpose compute engine.

Core abstractions: KStream, KTable, GlobalKTable

Kafka Streams gives you two primary views over a topic, plus a globally-replicated variant.

Abstraction	Models	Semantics	Typical use
`KStream<K,V>`	An unbounded stream of records	Every record is an independent event (insert/append)	Clicks, payments, page views, raw events
`KTable<K,V>`	A changelog stream as a table	Latest value per key (upsert; null = delete)	Current account balance, latest user profile
`GlobalKTable<K,V>`	A fully replicated table	Entire table on every instance, not partitioned	Small reference/lookup data joined to a stream

A KStream treats each record as a fact that happened. A KTable interprets a topic as a changelog: records with the same key overwrite each other, so the table holds the most recent value per key. A GlobalKTable loads the whole topic into every application instance, which makes it ideal for enrichment lookups (for example, joining a stream of orders against a small countries table) without repartitioning the stream.

The topology: a processing DAG

You describe processing declaratively with the DSL, which builds a topology — a directed acyclic graph (DAG) of source nodes (topics), processor nodes (operations like filter, map, groupBy, aggregate, join), and sink nodes (output topics). When the application starts, this topology is instantiated and runs continuously.

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.Produced;

import java.util.Properties;

public class WordLengthApp {

    public static void main(String[] args) {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, "word-length-app");
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

        StreamsBuilder builder = new StreamsBuilder();
        KStream<String, String> words = builder.stream("words-input");

        words.filter((key, value) -> value != null && !value.isBlank())
             .mapValues(value -> String.valueOf(value.length()))
             .to("word-lengths-output", Produced.with(Serdes.String(), Serdes.String()));

        KafkaStreams streams = new KafkaStreams(builder.build(), props);
        streams.start();

        Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
    }
}

Send a few records to words-input and read the result topic:

kafka-console-producer.sh --bootstrap-server localhost:9092 --topic words-input
> kafka
> streams

kafka-console-consumer.sh --bootstrap-server localhost:9092 \
  --topic word-lengths-output --from-beginning

Output:

5
7

Scaling with partitions

Parallelism in Kafka Streams is driven by topic partitions. The application.id becomes the consumer-group id, so every instance you launch joins the same group and is assigned a subset of partitions. A topology is split into tasks, one per input partition, and tasks are distributed across instances and threads (num.stream.threads).

topic with 6 partitions
   |
   +-- instance A: tasks 0,1,2
   +-- instance B: tasks 3,4,5

If instance B dies, Kafka rebalances its tasks (and their state) onto instance A. The maximum useful parallelism is the number of input partitions — scaling past that leaves extra instances idle. State for each task is kept in a local store (RocksDB by default) and continuously backed up to a Kafka changelog topic, so a restarted or relocated task can restore its state exactly.

Exactly-once processing

Kafka Streams supports exactly-once semantics (EOS) end to end: consuming input, updating state, and producing output all commit atomically using Kafka transactions. Enable it with a single configuration key.

processing.guarantee=exactly_once_v2

With exactly_once_v2, a record cannot be processed twice and state updates cannot be partially applied, even across failures and rebalances. The default, at_least_once, is faster but may reprocess records after a crash. EOS adds modest latency from transactional commits, so reserve it for pipelines where duplicates are unacceptable (financial ledgers, billing, deduplicated counts).

How it compares to Spark and Flink

Spark Structured Streaming and Apache Flink are powerful, but they are cluster-based engines: you submit jobs to a managed runtime with its own scheduler, workers, and resource model.

Aspect	Kafka Streams	Spark / Flink
Deployment	Library inside your app	Job submitted to a cluster
Infrastructure	Just a Kafka cluster	Engine cluster + resource manager
Sources/sinks	Kafka only	Many connectors (HDFS, JDBC, files, Kafka, …)
Scaling	Add app instances	Add/allocate cluster workers
Best for	Kafka-to-Kafka microservices	Mixed-source, heavy batch + streaming analytics

Choose Kafka Streams when your data lives in Kafka and you want stream processing to be just another microservice. Choose Spark or Flink when you need a general engine spanning many systems or large-scale batch alongside streaming.

Best Practices

Always set a stable, unique application.id; it controls the consumer group, internal topic names, and state directories.
Size input topic partitions for your target parallelism up front — you cannot exceed partition count in parallelism, and changing partitions later is disruptive.
Specify explicit Serdes per operator rather than relying only on defaults, especially for keys after a groupBy or repartition.
Register a JVM shutdown hook that calls streams.close() so tasks leave the group cleanly and commit offsets.
Use exactly_once_v2 only where duplicates are truly unacceptable; otherwise the default at_least_once gives better throughput.
Provision persistent, fast local disk for RocksDB state stores so state restoration from changelogs stays quick after rebalances.