Running on Kubernetes (Strimzi)
Running Kafka on Kubernetes by hand means juggling StatefulSets, headless services, persistent volumes, and config maps — and then reinventing all the operational logic for rolling upgrades, broker reconfiguration, and certificate rotation. Strimzi replaces that toil with a Kubernetes operator: you declare the cluster you want as custom resources, and the operator reconciles reality to match. This page walks through installing Strimzi, defining a KRaft-based Kafka cluster, managing topics and users declaratively, and how the operator performs safe rolling updates.
How Strimzi works
Strimzi extends the Kubernetes API with Custom Resource Definitions (CRDs) such as Kafka, KafkaNodePool, KafkaTopic, and KafkaUser. The Cluster Operator watches these resources and creates the underlying StatefulSets (or StrimziPodSets), services, and volumes. Two other operators — the Topic Operator and User Operator — run inside the cluster to reconcile KafkaTopic and KafkaUser resources against the running brokers.
Because the desired state lives in version-controlled YAML, Strimzi fits naturally into GitOps workflows. You never run kafka-topics.sh against production again — you commit a KafkaTopic and let the operator apply it.
Installing the operator
Install the Cluster Operator into a dedicated namespace. The simplest path applies the published install bundle; for production, prefer the Helm chart or OperatorHub so upgrades are managed.
kubectl create namespace kafka
# Install the Cluster Operator, scoped to the "kafka" namespace
kubectl apply -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka
# Wait for the operator to become ready
kubectl wait deployment/strimzi-cluster-operator \
--for=condition=Available --timeout=120s -n kafka
Output:
deployment.apps/strimzi-cluster-operator condition met
Defining a Kafka cluster (KRaft)
Modern Strimzi runs Kafka in KRaft mode, eliminating ZooKeeper. Brokers and controllers are described with KafkaNodePool resources, while the Kafka resource holds cluster-wide configuration like listeners. The strimzi.io/kraft: enabled annotation activates KRaft, and strimzi.io/node-pools: enabled opts into node pools.
The example below defines a controller pool and a broker pool of three replicas each, with persistent storage and two listeners — a plaintext internal listener and a TLS-secured external one.
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: controller
namespace: kafka
labels:
strimzi.io/cluster: prod-cluster
spec:
replicas: 3
roles:
- controller
storage:
type: persistent-claim
size: 20Gi
deleteClaim: false
---
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: broker
namespace: kafka
labels:
strimzi.io/cluster: prod-cluster
spec:
replicas: 3
roles:
- broker
storage:
type: persistent-claim
size: 500Gi
class: fast-ssd
deleteClaim: false
---
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: prod-cluster
namespace: kafka
annotations:
strimzi.io/kraft: enabled
strimzi.io/node-pools: enabled
spec:
kafka:
version: 3.9.0
listeners:
- name: plain
port: 9092
type: internal
tls: false
- name: tls
port: 9093
type: internal
tls: true
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2
entityOperator:
topicOperator: {}
userOperator: {}
Apply it and watch the operator build the cluster:
kubectl apply -f kafka-cluster.yaml
kubectl wait kafka/prod-cluster --for=condition=Ready --timeout=300s -n kafka
Keep controllers on a dedicated node pool rather than combining
controllerandbrokerroles on the same pods. Isolating the metadata quorum protects cluster stability during heavy broker load and makes scaling brokers independent of the controller count.
Managing topics declaratively
With the Topic Operator enabled, you create topics by applying KafkaTopic resources. The operator keeps Kafka and Kubernetes in sync — edit the YAML to change partitions or retention and the change is applied to the broker.
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: orders
namespace: kafka
labels:
strimzi.io/cluster: prod-cluster
spec:
partitions: 12
replicas: 3
config:
retention.ms: "604800000"
min.insync.replicas: "2"
cleanup.policy: "delete"
Partition counts can only be increased, never decreased. Lowering
partitionsin aKafkaTopicis rejected by the operator — plan partitioning up front.
Managing users and access
When a listener has authentication enabled, the User Operator provisions credentials from KafkaUser resources. The operator generates the secret (TLS certificate or SCRAM password) into a Kubernetes Secret and configures matching ACLs on the brokers.
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
name: orders-service
namespace: kafka
labels:
strimzi.io/cluster: prod-cluster
spec:
authentication:
type: scram-sha-512
authorization:
type: simple
acls:
- resource:
type: topic
name: orders
patternType: literal
operations: [Read, Write, Describe]
host: "*"
The generated credentials land in a secret named after the user:
kubectl get secret orders-service -n kafka -o jsonpath='{.data.password}' | base64 -d
How the operator handles rolling updates
The Cluster Operator owns the lifecycle of every broker pod. When you change the Kafka resource — a new version, a config tweak, or a certificate renewal — the operator rolls pods one at a time. Before restarting a broker it verifies the broker is not hosting any under-replicated partitions and that taking it down will not drop any partition below min.insync.replicas. If a restart would break availability, the operator waits.
Configuration changes that Kafka supports dynamically are applied via the Admin API without a restart at all; only changes requiring a process restart trigger a rolling bounce. Version upgrades follow Kafka’s two-phase protocol: binaries roll first, then the inter.broker.protocol.version and metadata version are bumped once every broker is on the new release. This sequencing means upgrades are zero-downtime as long as your topics have replication factor of at least 2 and clients retry.
| CR change | Operator action |
|---|---|
| Dynamic config (e.g. some topic defaults) | Applied via Admin API, no restart |
| Static broker config | Rolling restart, availability-checked |
version bump | Two-phase rolling upgrade |
| TLS certificate renewal | Rolling restart, automatic |
replicas increase in node pool | New pods added, no existing pod restart |
Best Practices
- Use separate
KafkaNodePoolresources for controllers and brokers so you can scale and upgrade each role independently. - Pin
spec.kafka.versionexplicitly and upgrade deliberately; never rely on a floating version in production. - Store all CRs in Git and apply them through a GitOps pipeline so the cluster state is auditable and reproducible.
- Set
deleteClaim: falseon storage to prevent accidental data loss when a node pool or cluster resource is deleted. - Always run replication factor 3 with
min.insync.replicas: 2so the availability-aware rolling restart logic can actually keep the cluster online. - Enable the Entity Operator and manage topics and users as
KafkaTopic/KafkaUserresources rather than imperative CLI commands. - Configure pod anti-affinity and spread brokers across availability zones so a single node or zone failure cannot take down a quorum.