Dead Letter & Retry Topics
In Kafka, records on a partition are consumed strictly in order. That ordering guarantee becomes a liability the moment one record cannot be processed: a malformed payload, a downstream service that is down, or a bug that throws on a specific message. If the consumer keeps retrying that record in place, the entire partition stalls behind it — a single “poison message” can halt your whole stream. The retry-topic plus dead-letter-topic (DLT) pattern solves this by moving the failing record off the main partition, retrying it with a delay, and parking it in a DLT after a maximum number of attempts so the main flow keeps moving.
The core idea
Instead of blocking, you forward a failed record to a dedicated retry topic. A consumer of that retry topic waits a short backoff, then tries again. If it still fails, it is forwarded to the next retry topic (often with a longer delay), and so on. Once the configured attempt limit is exhausted, the record lands in the dead-letter topic, where it sits untouched until a human or an automated job decides what to do with it. The main topic’s partition is never held hostage by one bad record.
This is non-blocking retry: the original consumer commits the offset and advances as soon as it has handed the record off to a retry topic. Good records keep flowing while bad ones are quarantined and retried on a separate timeline.
Flow at a glance
+------------------+
producers --> | orders (main) |
+--------+---------+
| consume
v
+------------------+ fail +-------------------------+
| OrderListener |--------> | orders-retry-0 (~5s) |
+------------------+ +-----------+-------------+
^ success | fail
| v
| +-------------------------+
| success | orders-retry-1 (~30s) |
+--------------------+-----------+-------------+
| fail (max)
v
+-------------------------+
| orders-dlt (parked) |
+-------------------------+
Non-blocking retry with @RetryableTopic
Spring for Apache Kafka ships this whole machinery behind a single annotation. @RetryableTopic auto-creates the retry and dead-letter topics, wires up the backoff, and forwards records between them. Add @DltHandler to a method to receive records that exhausted all attempts.
@Component
public class OrderListener {
private final OrderService orderService;
public OrderListener(OrderService orderService) {
this.orderService = orderService;
}
@RetryableTopic(
attempts = "4",
backoff = @Backoff(delay = 5000, multiplier = 6.0),
topicSuffixingStrategy = TopicSuffixingStrategy.SUFFIX_WITH_INDEX_VALUE,
dltStrategy = DltStrategy.FAIL_ON_ERROR,
exclude = { DeserializationException.class }
)
@KafkaListener(topics = "orders", groupId = "order-processor")
public void handle(OrderPlaced event) {
orderService.process(event); // may throw -> triggers retry flow
}
@DltHandler
public void onDlt(OrderPlaced event,
@Header(KafkaHeaders.ORIGINAL_TOPIC) String topic) {
log.error("Order parked in DLT from {}: {}", topic, event);
}
}
With attempts = "4" and a multiplier of 6.0, Spring creates orders-retry-0, orders-retry-1, orders-retry-2, and orders-dlt, with backoffs of roughly 5s, 30s, and 180s. The OrderPlaced event is a record:
public record OrderPlaced(String orderId, String customerId, BigDecimal total) {}
Use
exclude(orinclude) to keep non-retryable errors out of the retry loop. ADeserializationExceptionwill never succeed on retry — send it straight to the DLT instead of wasting three rounds of backoff on it.
Retryable vs non-retryable failures
Not every exception deserves a retry. Classify them deliberately.
| Failure type | Example | Action |
|---|---|---|
| Transient | Downstream HTTP 503, timeout, lock contention | Retry with backoff |
| Rate limited | Provider returns “429” | Retry with longer backoff |
| Permanent (data) | Bad schema, missing required field | Skip retries, go to DLT |
| Permanent (logic) | Business rule violation | Go to DLT, alert |
Monitoring the DLT
A DLT that nobody watches is just a silent data-loss sink. Treat the dead-letter topic as a first-class operational signal: alert on its message rate and lag. Records arrive with diagnostic headers that explain why they failed.
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
--topic orders-dlt --from-beginning --max-messages 1 \
--property print.headers=true
Output:
kafka_dlt-exception-fqcn:java.lang.IllegalStateException,
kafka_dlt-exception-message:inventory service unavailable,
kafka_dlt-original-topic:orders,
kafka_dlt-original-partition:2,
kafka_dlt-original-offset:48817
{"orderId":"A-1042","customerId":"C-77","total":129.50}
Set an alert on
orders-dltsuch as “any record in the last 5 minutes pages on-call.” The number of records in your DLT is a direct, business-meaningful measure of unhandled failures.
Reprocessing from the DLT
Once the root cause is fixed — the downstream service is back, or you’ve patched the bug — you replay the parked records. The simplest approach is a small consumer that reads the DLT and republishes each record to the original topic, where the normal flow picks it up again.
@KafkaListener(topics = "orders-dlt", groupId = "dlt-replay", autoStartup = "false")
public void replay(ConsumerRecord<String, OrderPlaced> rec) {
kafkaTemplate.send("orders", rec.key(), rec.value());
}
Keep autoStartup = "false" so replay only runs when you deliberately start the listener (for example via an actuator endpoint or admin command), not continuously.
Best practices
- Always set a finite
attemptslimit so records cannot retry forever; let them flow to the DLT and surface as an alert. - Use exponential backoff with a multiplier so transient outages have time to recover without hammering the downstream service.
excludedeserialization and validation errors from retries — they are deterministic and will never succeed.- Treat the DLT as a monitored queue: alert on its rate and lag, and never let it grow silently.
- Preserve original headers (topic, partition, offset, exception) so every parked record is self-describing for debugging and replay.
- Make reprocessing explicit and idempotent — replayed records may be processed more than once, so design
process()to tolerate duplicates. - Size retry-topic backoffs to your SLA: long enough to outlast typical blips, short enough that recovery is timely.