Dead Letter & Retry Topics

In Kafka, records on a partition are consumed strictly in order. That ordering guarantee becomes a liability the moment one record cannot be processed: a malformed payload, a downstream service that is down, or a bug that throws on a specific message. If the consumer keeps retrying that record in place, the entire partition stalls behind it — a single “poison message” can halt your whole stream. The retry-topic plus dead-letter-topic (DLT) pattern solves this by moving the failing record off the main partition, retrying it with a delay, and parking it in a DLT after a maximum number of attempts so the main flow keeps moving.

The core idea

Instead of blocking, you forward a failed record to a dedicated retry topic. A consumer of that retry topic waits a short backoff, then tries again. If it still fails, it is forwarded to the next retry topic (often with a longer delay), and so on. Once the configured attempt limit is exhausted, the record lands in the dead-letter topic, where it sits untouched until a human or an automated job decides what to do with it. The main topic’s partition is never held hostage by one bad record.

This is non-blocking retry: the original consumer commits the offset and advances as soon as it has handed the record off to a retry topic. Good records keep flowing while bad ones are quarantined and retried on a separate timeline.

Flow at a glance

                 +------------------+
   producers --> |   orders (main)  |
                 +--------+---------+
                          | consume
                          v
                 +------------------+   fail   +-------------------------+
                 |  OrderListener   |--------> | orders-retry-0  (~5s)   |
                 +------------------+          +-----------+-------------+
                          ^ success                        | fail
                          |                                v
                          |                    +-------------------------+
                          |       success      | orders-retry-1  (~30s)  |
                          +--------------------+-----------+-------------+
                                                           | fail (max)
                                                           v
                                               +-------------------------+
                                               | orders-dlt (parked)     |
                                               +-------------------------+

Non-blocking retry with @RetryableTopic

Spring for Apache Kafka ships this whole machinery behind a single annotation. @RetryableTopic auto-creates the retry and dead-letter topics, wires up the backoff, and forwards records between them. Add @DltHandler to a method to receive records that exhausted all attempts.

@Component
public class OrderListener {

    private final OrderService orderService;

    public OrderListener(OrderService orderService) {
        this.orderService = orderService;
    }

    @RetryableTopic(
        attempts = "4",
        backoff = @Backoff(delay = 5000, multiplier = 6.0),
        topicSuffixingStrategy = TopicSuffixingStrategy.SUFFIX_WITH_INDEX_VALUE,
        dltStrategy = DltStrategy.FAIL_ON_ERROR,
        exclude = { DeserializationException.class }
    )
    @KafkaListener(topics = "orders", groupId = "order-processor")
    public void handle(OrderPlaced event) {
        orderService.process(event); // may throw -> triggers retry flow
    }

    @DltHandler
    public void onDlt(OrderPlaced event,
                      @Header(KafkaHeaders.ORIGINAL_TOPIC) String topic) {
        log.error("Order parked in DLT from {}: {}", topic, event);
    }
}

With attempts = "4" and a multiplier of 6.0, Spring creates orders-retry-0, orders-retry-1, orders-retry-2, and orders-dlt, with backoffs of roughly 5s, 30s, and 180s. The OrderPlaced event is a record:

public record OrderPlaced(String orderId, String customerId, BigDecimal total) {}

Use exclude (or include) to keep non-retryable errors out of the retry loop. A DeserializationException will never succeed on retry — send it straight to the DLT instead of wasting three rounds of backoff on it.

Retryable vs non-retryable failures

Not every exception deserves a retry. Classify them deliberately.

Failure type	Example	Action
Transient	Downstream HTTP 503, timeout, lock contention	Retry with backoff
Rate limited	Provider returns “429”	Retry with longer backoff
Permanent (data)	Bad schema, missing required field	Skip retries, go to DLT
Permanent (logic)	Business rule violation	Go to DLT, alert

Monitoring the DLT

A DLT that nobody watches is just a silent data-loss sink. Treat the dead-letter topic as a first-class operational signal: alert on its message rate and lag. Records arrive with diagnostic headers that explain why they failed.

kafka-console-consumer.sh --bootstrap-server localhost:9092 \
  --topic orders-dlt --from-beginning --max-messages 1 \
  --property print.headers=true

Output:

kafka_dlt-exception-fqcn:java.lang.IllegalStateException,
kafka_dlt-exception-message:inventory service unavailable,
kafka_dlt-original-topic:orders,
kafka_dlt-original-partition:2,
kafka_dlt-original-offset:48817
{"orderId":"A-1042","customerId":"C-77","total":129.50}

Set an alert on orders-dlt such as “any record in the last 5 minutes pages on-call.” The number of records in your DLT is a direct, business-meaningful measure of unhandled failures.

Reprocessing from the DLT

Once the root cause is fixed — the downstream service is back, or you’ve patched the bug — you replay the parked records. The simplest approach is a small consumer that reads the DLT and republishes each record to the original topic, where the normal flow picks it up again.

@KafkaListener(topics = "orders-dlt", groupId = "dlt-replay", autoStartup = "false")
public void replay(ConsumerRecord<String, OrderPlaced> rec) {
    kafkaTemplate.send("orders", rec.key(), rec.value());
}

Keep autoStartup = "false" so replay only runs when you deliberately start the listener (for example via an actuator endpoint or admin command), not continuously.

Best practices

Always set a finite attempts limit so records cannot retry forever; let them flow to the DLT and surface as an alert.
Use exponential backoff with a multiplier so transient outages have time to recover without hammering the downstream service.
exclude deserialization and validation errors from retries — they are deterministic and will never succeed.
Treat the DLT as a monitored queue: alert on its rate and lag, and never let it grow silently.
Preserve original headers (topic, partition, offset, exception) so every parked record is self-describing for debugging and replay.
Make reprocessing explicit and idempotent — replayed records may be processed more than once, so design process() to tolerate duplicates.
Size retry-topic backoffs to your SLA: long enough to outlast typical blips, short enough that recovery is timely.