Kafka Connect Introduction

Most real systems do not start with Kafka — data already lives in relational databases, object stores, search indexes, and SaaS APIs. Kafka Connect is the framework and runtime that moves that data in and out of Kafka reliably, without writing and operating bespoke producer or consumer applications for every integration. You declare a connector with a JSON configuration, and Connect handles the hard parts — parallelism, offset tracking, restarts, retries, and at-least-once delivery — for you. This page explains what Connect is, the difference between source and sink connectors, why you reach for it instead of hand-rolled clients, and where to find connectors that already exist.

What Kafka Connect is

Kafka Connect is a separate component of the Apache Kafka project: a distributed, scalable runtime whose only job is to copy data between Kafka and other systems. It is not part of the broker — it runs as its own cluster of JVM worker processes that talk to the brokers over the normal client protocol. You never write the data-movement loop yourself; instead you install a connector plugin and submit configuration to the Connect REST API.

The unit of work is a connector, which Connect splits into one or more tasks that run in parallel across the workers. A connector plugin is a JAR (a Connector implementation plus its Task classes) that knows how to read from or write to a specific external system. The same runtime hosts many connectors at once, each with its own configuration and lifecycle.

  External system            Kafka Connect cluster              Kafka
 ┌──────────────┐           ┌─────────────────────┐          ┌────────┐
 │  PostgreSQL  │──source──▶ │ worker 1 │ worker 2 │──produce─▶│ topics │
 └──────────────┘           │  tasks distributed  │          │        │
 ┌──────────────┐           │  across workers     │          │        │
 │ Elasticsearch│◀──sink──── │ worker 1 │ worker 2 │◀─consume──│        │
 └──────────────┘           └─────────────────────┘          └────────┘

Source vs sink connectors

There are exactly two directions, and every connector is one or the other.

	Source connector	Sink connector
Direction	External system → Kafka	Kafka → external system
Acts as	A Kafka producer	A Kafka consumer
Tracks position with	Source offsets (e.g. DB log position)	Consumer group offsets
Typical examples	Debezium (CDC), JDBC source, file source	JDBC sink, Elasticsearch, S3, BigQuery

A source connector pulls records from an external system and writes them to Kafka topics — for example, streaming row changes out of PostgreSQL. A sink connector consumes from Kafka topics and writes the records into an external system — for example, indexing every event into Elasticsearch. Connect stores source progress in an internal offsets topic and sink progress as ordinary consumer-group offsets, so both survive restarts.

Why use Connect instead of bespoke clients

You could write a Spring Boot service with a KafkaConsumer that reads a topic and upserts rows into a database. For one-off needs that is fine, but it pushes a long list of concerns onto you that Connect already solves:

Scaling and rebalancing — increase tasks.max and Connect spreads work across workers; add a worker and tasks rebalance automatically.
Offset management — Connect commits offsets and resumes exactly where it left off after a crash, with no custom bookkeeping.
Retries and dead-letter handling — built-in retry/backoff and a configurable dead-letter queue for poison records.
Schema and format conversion — pluggable converters (JSON, Avro, Protobuf) decouple the wire format from the connector code.
Operations — health, status, and reconfiguration via a uniform REST API and JMX metrics, instead of N hand-built services.

A good rule of thumb: if the work is “move data between Kafka and a known system with no business logic,” use Kafka Connect. Reserve custom producers/consumers and Kafka Streams for transformation, enrichment, and application logic.

Submitting a connector

A connector is just configuration. In distributed mode you POST it to the worker REST API; here is a minimal sink that writes a topic to JDBC.

curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d '{
    "name": "orders-jdbc-sink",
    "config": {
      "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
      "tasks.max": "2",
      "topics": "orders",
      "connection.url": "jdbc:postgresql://db:5432/shop",
      "connection.user": "connect",
      "connection.password": "secret",
      "auto.create": "true",
      "insert.mode": "upsert",
      "pk.mode": "record_key"
    }
  }'

Check that it came up and that both tasks are running:

curl -s http://localhost:8083/connectors/orders-jdbc-sink/status

Output:

{"name":"orders-jdbc-sink","connector":{"state":"RUNNING","worker_id":"10.0.0.4:8083"},
 "tasks":[{"id":0,"state":"RUNNING","worker_id":"10.0.0.4:8083"},
          {"id":1,"state":"RUNNING","worker_id":"10.0.0.5:8083"}],"type":"sink"}

The connector ecosystem

You rarely need to write a connector. Confluent Hub (hub.confluent.io) is the central catalog of hundreds of community and commercial connectors — databases, cloud storage, message queues, and SaaS systems — installable with the confluent-hub CLI or by dropping the plugin JAR into the worker’s plugin.path.

confluent-hub install confluentinc/kafka-connect-jdbc:latest

Notable building blocks in the ecosystem include Debezium for change-data-capture sources, the Confluent S3 and JDBC connectors, and Single Message Transforms (SMTs) for lightweight per-record tweaks. Only when no existing plugin fits your system do you implement the SourceConnector/SinkConnector APIs yourself.

Best practices

Prefer Connect over custom clients for plain data movement; keep transformation and business logic in Kafka Streams or your application.
Run Connect in distributed mode in production (even on a single node) so configs, offsets, and status live in Kafka and survive restarts.
Set tasks.max to match the source’s parallelism (e.g. number of partitions or tables) — more tasks than work gives you idle tasks, not more throughput.
Standardize on one converter (Avro or JSON Schema with a Schema Registry) across connectors so producers and consumers agree on the wire format.
Configure a dead-letter queue (errors.tolerance=all, errors.deadletterqueue.topic.name=...) so a single bad record never stalls a connector.
Pin connector plugin versions in plugin.path and isolate plugins to avoid classpath conflicts between connectors.