Dead-Letter Queues & Poison-Message Handling
A dead-letter queue (DLQ) is the holding pen for messages that a worker cannot process successfully, no matter how many times the broker redelivers them. Without one, a single malformed payload can loop through your workers forever, burning CPU, inflating retry metrics, and starving healthy jobs of capacity. This guide is part of Queue Fundamentals & Architecture and covers the full lifecycle: why messages turn poisonous, how maxReceiveCount and redrive policies decide when to quarantine them, how the major brokers implement DLQs, and how to monitor, alert on, and replay the messages that land there.
The core problem is that at-least-once delivery — the default contract for nearly every queue — means a message is redelivered whenever a worker fails to acknowledge it. That is exactly what you want for transient failures like a dropped database connection. It is exactly what you do not want for a permanently broken message, because the broker has no way to tell the two cases apart. A DLQ closes that gap by counting redeliveries and diverting any message that exceeds a threshold, so transient failures still retry while permanent failures get isolated for human or automated review.
What Makes a Message Poison
A "poison message" is any message that fails processing deterministically — the same input produces the same failure every time. Re-delivering it is pointless and harmful. The common causes cluster into a few categories:
- Deserialization and schema errors. The producer emitted a payload the consumer cannot parse: a renamed field, a JSON document where the worker expected Protobuf, an enum value added after the consumer was deployed. These fail before business logic even runs. See Message Size Limits & Serialization for how schema drift creeps in.
- Referential failures. The job references a row, file, or tenant that was deleted between enqueue and dequeue. The handler throws
NotFoundon every attempt. - Logic bugs triggered by specific data. A division by zero, an unhandled null, an off-by-one on an empty collection — defects that only fire for certain inputs.
- Timeouts on oversized work. A job that legitimately cannot finish inside the visibility timeout is redelivered mid-flight, never acknowledged, and counts as a failure on each pass even though nothing is "wrong" with the payload.
- Downstream hard-failures. A third-party API returns a permanent
403or422for this specific request, distinct from a transient503.
The distinction that matters operationally is transient vs. permanent. Retries exist to paper over transient faults. The DLQ exists to catch the permanent ones before they consume infinite retries. Everything in DLQ design is a mechanism for drawing that line automatically.
There is a second-order cost worth naming explicitly: a poison message that loops is not just wasted work, it is contended work. In a queue with bounded worker concurrency, every slot a poison message occupies is a slot a healthy message cannot use, so a single fast-failing payload can degrade throughput for the entire workload. This is the head-of-line effect, and it is why "just let it retry forever" is never a viable strategy at scale — the failure does not stay contained to one message. A DLQ converts an unbounded, contended retry loop into a bounded one, capping the blast radius of any single bad payload at maxReceiveCount attempts.
How Redelivery Counting Drives Quarantine
Every DLQ mechanism rests on a counter. The broker tracks how many times a message has been delivered without acknowledgment, compares that count against a configured ceiling, and moves the message to the dead-letter destination once the ceiling is exceeded. The names differ — SQS calls it maxReceiveCount, RabbitMQ derives it from rejection or expiry events, BullMQ calls it attempts — but the principle is identical.
The ceiling is a trade-off. Set it too low and you dead-letter messages that would have succeeded on a second attempt after a transient blip, creating noise and operational toil. Set it too high and a genuine poison message wastes dozens of worker-seconds before quarantine, and your DLQ fills slowly enough that you notice the problem late. A practical starting point for most workloads is three to five receives, paired with exponential backoff between attempts so transient faults get real time to clear. Latency-sensitive queues with cheap idempotent retries can go higher; expensive jobs (large ETL passes, paid API calls) should go lower.
Backoff and the receive count interact. With no delay between retries, a poison message can exhaust maxReceiveCount in milliseconds and reach the DLQ almost instantly — fine for deserialization errors, bad for a downstream service that is momentarily overloaded. Spacing retries out gives transient failures room to recover before the counter runs out. This is also where DLQ design overlaps with exactly-once vs at-least-once delivery: because every retry is a re-delivery, your handler must be idempotent, or each attempt before the DLQ can leave a partial side effect behind.
DLQ Design Across Brokers
AWS SQS — Redrive Policy
SQS implements dead-lettering with a RedrivePolicy attached to the source queue. It names a target DLQ ARN and a maxReceiveCount. The broker increments an approximate receive count per message; once a receive pushes the count over the limit and the message is not deleted, SQS moves it to the DLQ on the next expiry.
# terraform/sqs-dlq.tf
resource "aws_sqs_queue" "dlq" {
name = "orders-dlq"
message_retention_seconds = 1209600 # 14 days — keep poison messages long enough to triage
}
resource "aws_sqs_queue" "main" {
name = "orders"
visibility_timeout_seconds = 60 # must exceed p99 job duration to avoid false redelivery
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.dlq.arn
maxReceiveCount = 5 # 5 failed receives -> quarantine
})
}
# Restrict which source queues may redrive into this DLQ
resource "aws_sqs_queue_redrive_allow_policy" "dlq_allow" {
queue_url = aws_sqs_queue.dlq.id
redrive_allow_policy = jsonencode({
redrivePermission = "byQueue"
sourceQueueArns = [aws_sqs_queue.main.arn]
})
}
The full walkthrough — CLI, console, alarms, and moving messages back — lives in Configuring an SQS redrive policy.
RabbitMQ — Dead-Letter Exchange
RabbitMQ has no built-in receive counter. Instead, a queue declares a dead-letter exchange (DLX) via x-dead-letter-exchange, and messages are dead-lettered when they are rejected with requeue=false, when they expire (TTL), or when the queue overflows. To emulate maxReceiveCount, you read the x-death header — RabbitMQ stamps each dead-lettering event there with a count — and reject permanently once it crosses your threshold.
# pika: declare a work queue that dead-letters rejected messages to a DLX
import pika
conn = pika.BlockingConnection(pika.ConnectionParameters("rabbitmq-host"))
ch = conn.channel()
# The dead-letter exchange and its backing queue
ch.exchange_declare(exchange="dlx", exchange_type="direct", durable=True)
ch.queue_declare(queue="orders.dlq", durable=True)
ch.queue_bind(queue="orders.dlq", exchange="dlx", routing_key="orders")
# The main work queue points failed messages at the DLX
ch.queue_declare(
queue="orders",
durable=True,
arguments={
"x-dead-letter-exchange": "dlx",
"x-dead-letter-routing-key": "orders",
},
)
The handler inspects x-death to decide between retry and permanent rejection:
def on_message(ch, method, props, body):
deaths = (props.headers or {}).get("x-death", [])
redeliveries = deaths[0]["count"] if deaths else 0
try:
process(body)
ch.basic_ack(method.delivery_tag)
except PermanentError:
# Bad payload: send straight to the DLQ, no requeue
ch.basic_reject(method.delivery_tag, requeue=False)
except TransientError:
if redeliveries >= 5:
ch.basic_reject(method.delivery_tag, requeue=False) # exhausted -> DLQ
else:
ch.basic_nack(method.delivery_tag, requeue=True) # try again
The step-by-step recovery procedure is in Replaying dead-lettered messages in RabbitMQ.
Redis / BullMQ — Attempts and the Failed Set
BullMQ does not use a separate physical queue for dead letters by default. A job that exhausts its attempts moves into the failed set, which functions as the DLQ: jobs sit there with their stack trace and full payload until you retry or remove them. You can promote this into a true secondary queue by re-enqueuing exhausted jobs onto a dedicated *.dead queue from a failed listener.
// BullMQ: cap attempts, back off, and forward fully-failed jobs to a dead queue
import { Queue, Worker } from "bullmq";
import { Redis } from "ioredis";
const connection = new Redis({ maxRetriesPerRequest: null });
const deadQueue = new Queue("orders.dead", { connection });
const worker = new Worker("orders", async (job) => process(job.data), {
connection,
attempts: 5, // total tries before "failed"
backoff: { type: "exponential", delay: 2000 }, // 2s, 4s, 8s, ... between tries
});
worker.on("failed", async (job, err) => {
if (job && job.attemptsMade >= (job.opts.attempts ?? 1)) {
// Exhausted: quarantine with diagnostic context
await deadQueue.add("dead", { original: job.data, error: err.message, jobId: job.id });
}
});
Broker Comparison
| Capability | AWS SQS | RabbitMQ | Redis / BullMQ |
||---|---|---|
| Counter mechanism | maxReceiveCount (server-side) | x-death header count (manual check) | attempts / attemptsMade |
| DLQ destination | Separate SQS queue (ARN) | Dead-letter exchange + queue | failed set or a dedicated dead queue |
| Trigger | Receive count exceeded | Reject requeue=false, TTL, overflow | Attempts exhausted |
| Built-in backoff | Per-queue redrive only | None (use plugin/delayed exchange) | Native backoff option |
| Replay tooling | StartMessageMoveTask, console redrive | Shovel plugin / custom script | job.retry() / re-enqueue |
| Native depth metric | ApproximateNumberOfMessagesVisible | messages per queue (mgmt API) | getJobCounts() |
Failure Modes & Recovery
The DLQ that nobody watches. The most common failure is operational, not technical: messages pile up in a DLQ with no alarm, and the team discovers thousands of lost orders weeks later. A DLQ without monitoring is a silent black hole. Always alert on depth (covered below) before you ship the queue.
The poisoned DLQ. If your DLQ itself has a RedrivePolicy or DLX pointing somewhere, a message can be dead-lettered out of the DLQ and lost. DLQs should be terminal — no onward redrive — with long retention instead.
Replay storms. Bulk-redriving a full DLQ back to the main queue without first fixing the root cause simply re-poisons it, often worse, because every message fails again and may now also overwhelm a downstream that recovered in the meantime. Always fix the defect, then replay in controlled batches.
Lost context. A bare message in a DLQ tells you nothing about why it failed. Stamp failure metadata — exception class, stack trace, attempt count, timestamp — onto the message or an adjacent record before quarantining, as the BullMQ and Celery examples do. This makes triage minutes instead of hours. Celery's own retry and exception flow is detailed in Celery task retry and error handling.
Visibility-timeout false positives. Messages that time out rather than truly fail will march toward the DLQ even though the payload is valid. If your DLQ fills with timeouts, the fix is the timeout, not the retry ceiling — revisit the visibility timeout deep dive.
Monitoring, Alerting, and Replay Workflows
DLQ depth is one of the highest-signal metrics in a queue system: in steady state it should be at or near zero, so any sustained non-zero value is a real incident. Export depth to your metrics backend and alert aggressively. Broader instrumentation patterns are covered under Observability & Monitoring for Job Queues.
# Alert: any messages sitting in a dead-letter queue
- alert: DeadLetterQueueNonEmpty
expr: aws_sqs_approximate_number_of_messages_visible{queue_name=~".*-dlq"} > 0
for: 5m
labels: { severity: critical }
annotations:
summary: "Messages in DLQ {{ $labels.queue_name }}"
description: "{{ $value }} message(s) quarantined — investigate root cause before redriving."
# Alert: rapid growth indicates an active poison-message storm
- alert: DeadLetterQueueGrowing
expr: delta(aws_sqs_approximate_number_of_messages_visible{queue_name=~".*-dlq"}[10m]) > 50
for: 5m
labels: { severity: critical }
annotations:
summary: "DLQ {{ $labels.queue_name }} growing fast"
Depth alone is not the whole picture. Two supporting signals make DLQ monitoring actionable. The first is age of the oldest message (ApproximateAgeOfOldestMessage on SQS), which tells you whether a quarantined message is fresh or has been silently rotting for days — a high age on a low-depth DLQ is often a more urgent signal than a spiking depth, because it means a real failure went unnoticed. The second is the dead-letter rate — the derivative of depth — which distinguishes a one-off bad payload from an active storm caused by a deploy regression or a downstream outage. Alert on depth for "anything is wrong," on rate for "something just broke," and on age for "we have been ignoring this."
A disciplined replay workflow has four stages. First, inspect: pull a sample of dead-lettered messages and read the failure metadata to classify the root cause. Second, fix: deploy the code or data correction that addresses the cause — never replay before this. Third, replay in batches: move a small batch back to the source queue and confirm it drains cleanly before moving the rest, so a misdiagnosis fails small. Fourth, verify and discard: confirm the DLQ returns to zero and purge anything that is genuinely unrecoverable (truly malformed payloads with no valid interpretation) so depth alerts stay meaningful. Treat the discard step as deliberately as the replay step — silently leaving unrecoverable messages in the DLQ permanently defeats the "DLQ should be zero in steady state" invariant your alerts depend on.
Because replay re-injects messages that already triggered side effects on earlier attempts, idempotency is non-negotiable. See Preventing duplicate job execution with idempotency for deduplication-key patterns that make replay safe.
Performance Tuning
- Right-size
maxReceiveCount. Three to five for typical idempotent jobs; one to two for expensive or paid operations; higher only when retries are cheap and failures are usually transient. - Pair the ceiling with backoff. Exponential backoff gives transient faults time to clear so they do not waste the receive budget. Without backoff, set the ceiling lower.
- Keep DLQ retention long. Fourteen days on SQS, durable queues on RabbitMQ — you want time to triage without losing data.
- Make DLQs terminal. No onward redrive, no DLX on the DLQ itself.
- Carry diagnostic metadata. Stamp exception, attempt count, and timestamp at quarantine time; it pays for itself on the first incident.
- Separate DLQs per workload class. A shared DLQ mixes unrelated failures and dilutes alerts; one DLQ per source queue keeps signals clean.
FAQ
What is the difference between a retry and a dead-letter queue?
A retry re-delivers the same message to a worker in the hope that a transient fault has cleared. A dead-letter queue is where a message goes after retries are exhausted — it is the giving-up mechanism, not the trying-again mechanism. You configure both together: the retry ceiling (maxReceiveCount / attempts) is the boundary at which retrying stops and dead-lettering begins.
Should I automatically replay messages from the DLQ? Only after you have identified and fixed the root cause. Automatic blind replay re-poisons the queue because the messages will fail the same way. Automated replay is reasonable for narrow, well-understood transient categories (for example, a known downstream outage that has recovered), but a poison message caused by a bad payload or code bug must be triaged by a human first.
How big should maxReceiveCount be? Start at three to five for ordinary idempotent jobs paired with exponential backoff. Lower it to one or two for expensive operations where wasted attempts are costly, and raise it only when retries are cheap and most failures are transient. The right number is the smallest one that still survives your normal rate of transient blips.
Do messages in a DLQ expire?
Yes — they are subject to the DLQ's own retention. On SQS, set message_retention_seconds high (up to the 14-day maximum) so quarantined messages survive long enough to triage. On RabbitMQ, use durable queues and avoid putting a TTL on the DLQ, or you will silently lose the very messages you wanted to preserve.
Related
- Queue Fundamentals & Architecture — the broader set of concepts this builds on.
- Visibility Timeout Deep Dive — timeout misconfiguration is a leading cause of false dead-lettering.
- Exactly-Once vs At-Least-Once Delivery — why redelivery happens and why DLQ replay demands idempotency.
- Replaying Dead-Letter Messages in RabbitMQ — step-by-step recovery using a dead-letter exchange.
- Configuring an SQS Redrive Policy — Terraform, console, and CLI setup with depth alarms.