Configuring an SQS redrive policy
An SQS redrive policy is what turns a plain queue into one that automatically quarantines messages it cannot process. This guide is part of Dead-Letter Queues & Poison-Message Handling within Queue Fundamentals & Architecture, and it walks through configuring the policy three ways — Terraform, console, and CLI — then tuning maxReceiveCount, locking the destination down with a redrive-allow-policy, moving messages back with StartMessageMoveTask, and alarming on dead-letter queue depth.
Problem Statement
Your orders queue occasionally receives a payload the consumer cannot process — a deleted product reference, a schema the worker does not understand. Right now those messages redeliver indefinitely, so the same poison message reappears every visibility-timeout cycle, inflating your ReceiveMessage calls and never draining. You want SQS to move any message that fails five receives into a dedicated orders-dlq, alert you when that DLQ is non-empty, and give you a one-command path to move fixed messages back.
Prerequisites
- An AWS account with permissions for
sqs:*andcloudwatch:PutMetricAlarmon the target queues. - Two queues, or the ability to create them: the source (
orders) and the dead-letter queue (orders-dlq). Both must be the same type — a standard DLQ for a standard source, FIFO for FIFO. - AWS CLI v2 configured, and/or Terraform 1.4+ if using infrastructure-as-code.
- An idempotent consumer, because every receive before the DLQ threshold is a redelivery.
Step 1 — Create the Dead-Letter Queue
The DLQ is an ordinary SQS queue; what makes it a DLQ is that another queue points at it. Give it long retention so quarantined messages survive triage.
# Create the DLQ first — its ARN is needed by the source queue's redrive policy
aws sqs create-queue \
--queue-name orders-dlq \
--attributes MessageRetentionPeriod=1209600 # 14 days, the SQS maximum
# Capture its ARN for the next step
DLQ_ARN=$(aws sqs get-queue-attributes \
--queue-url "$(aws sqs get-queue-url --queue-name orders-dlq --output text)" \
--attribute-names QueueArn --query 'Attributes.QueueArn' --output text)
echo "$DLQ_ARN"
Step 2 — Attach the RedrivePolicy to the Source Queue
The RedrivePolicy is a JSON attribute on the source queue naming the DLQ ARN and the receive ceiling.
# Point orders at orders-dlq, quarantining after 5 failed receives
aws sqs set-queue-attributes \
--queue-url "$(aws sqs get-queue-url --queue-name orders --output text)" \
--attributes "{\"RedrivePolicy\":\"{\\\"deadLetterTargetArn\\\":\\\"$DLQ_ARN\\\",\\\"maxReceiveCount\\\":\\\"5\\\"}\"}"
The policy has exactly two fields:
deadLetterTargetArn— the ARN of the DLQ from Step 1.maxReceiveCount— the number of receives after which SQS moves the message. A message is dead-lettered when a receive pushes its count past this value and it is not deleted.
One subtlety trips people up here: SQS tracks the receive count approximately and resets it only when a message is deleted, not when it is successfully processed by some other definition. The counter lives with the message for its entire lifetime in the queue. This means the threshold counts total receives across all consumers and all visibility cycles, not consecutive failures — there is no notion of "it succeeded twice then failed", only "it has been received N times without being deleted". For an idempotent consumer that deletes on success, this is exactly the behavior you want; for any other design it is a reason to make your consumer idempotent.
Step 3 — The Same Setup in Terraform
For reproducible infrastructure, define both queues, the redrive policy, and the redrive-allow-policy together.
# sqs.tf
resource "aws_sqs_queue" "dlq" {
name = "orders-dlq"
message_retention_seconds = 1209600 # 14 days to triage
}
resource "aws_sqs_queue" "orders" {
name = "orders"
visibility_timeout_seconds = 60 # must exceed p99 processing time
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.dlq.arn
maxReceiveCount = 5 # tune per workload (see Step 4)
})
}
# Lock the DLQ so only the orders queue may redrive into it
resource "aws_sqs_queue_redrive_allow_policy" "dlq" {
queue_url = aws_sqs_queue.dlq.id
redrive_allow_policy = jsonencode({
redrivePermission = "byQueue"
sourceQueueArns = [aws_sqs_queue.orders.arn]
})
}
The redrive_allow_policy on the DLQ is the inverse control: it declares which source queues are permitted to use this queue as their dead-letter target. Setting redrivePermission = "byQueue" with an explicit sourceQueueArns list prevents an unrelated queue from dumping unrelated failures into your DLQ and muddying your alarms. The alternatives are allowAll (any queue, the permissive default) and denyAll (none).
Step 4 — Tune maxReceiveCount
maxReceiveCount is the dial between noise and waste. Too low dead-letters messages that a transient blip would have cleared on the next attempt; too high burns worker time on genuine poison messages and delays detection.
| Workload | Suggested maxReceiveCount | Rationale | ||---|---| | Cheap, idempotent jobs | 5 | Absorbs transient blips, still quarantines quickly | | Expensive / paid API calls | 2-3 | Each wasted attempt costs money or rate-limit budget | | Hard deserialization failures | 1-2 | Will never succeed; quarantine fast | | Flaky downstream, mostly transient | 5-10 | Give recovery time, but pair with backoff |
SQS standard queues have no native inter-retry backoff, so a poison message can burn through maxReceiveCount in seconds. To space attempts out, raise the source queue's visibility_timeout_seconds or have the consumer call ChangeMessageVisibility with a backoff value on failure — a high ceiling without spacing simply quarantines faster, not smarter. The trade-offs mirror those in the visibility timeout deep dive.
Step 5 — Configure via the Console
If you prefer the AWS console: open the source queue, choose Edit, scroll to Dead-letter queue, toggle Enabled, select orders-dlq from the dropdown, and set Maximum receives to your tuned value. Save. Then open orders-dlq, choose Edit, expand Redrive allow policy, select By queue, and add the orders queue ARN. This produces exactly the same attributes as Steps 2 and 3.
Step 6 — Move Messages Back With StartMessageMoveTask
Once you have fixed the root cause, redrive the quarantined messages. The console has a Start DLQ redrive button on the DLQ page; programmatically, use StartMessageMoveTask, which moves messages from the DLQ back to their original source queue.
# Redrive everything in orders-dlq back to its source, throttled to protect downstream
aws sqs start-message-move-task \
--source-arn "$DLQ_ARN" \
--max-number-of-messages-per-second 50 # cap the replay rate
# Watch the move task's progress
aws sqs list-message-move-tasks --source-arn "$DLQ_ARN" \
--query 'Results[].[Status,ApproximateNumberOfMessagesMoved,ApproximateNumberOfMessagesToMove]' \
--output table
--max-number-of-messages-per-second is the throttle that prevents a replay storm from overwhelming a downstream that just recovered. Omitting it moves as fast as possible. To redirect messages to a different queue instead of the original source, pass --destination-arn. Never redrive before the fix is deployed, or the messages fail straight back into the DLQ.
Step 7 — Alarm on DLQ Depth
A DLQ nobody watches is a silent data-loss bug. Alarm on ApproximateNumberOfMessagesVisible for the DLQ so any quarantine pages you.
# cloudwatch.tf — page when the DLQ holds any messages
resource "aws_cloudwatch_metric_alarm" "dlq_not_empty" {
alarm_name = "orders-dlq-not-empty"
namespace = "AWS/SQS"
metric_name = "ApproximateNumberOfMessagesVisible"
dimensions = { QueueName = aws_sqs_queue.dlq.name }
statistic = "Maximum"
period = 60
evaluation_periods = 5
threshold = 0
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [aws_sns_topic.alerts.arn]
alarm_description = "Messages quarantined in orders-dlq — triage before redriving."
}
The same alarm via CLI:
aws cloudwatch put-metric-alarm \
--alarm-name orders-dlq-not-empty \
--namespace AWS/SQS --metric-name ApproximateNumberOfMessagesVisible \
--dimensions Name=QueueName,Value=orders-dlq \
--statistic Maximum --period 60 --evaluation-periods 5 \
--threshold 0 --comparison-operator GreaterThanThreshold \
--treat-missing-data notBreaching \
--alarm-actions "$ALERTS_TOPIC_ARN"
This is the AWS-native counterpart to the metric-backend alerting covered under Observability & Monitoring for Job Queues.
Verification
Confirm the policy is attached and behaves as configured. First, read it back:
# The RedrivePolicy attribute should echo your DLQ ARN and maxReceiveCount
aws sqs get-queue-attributes \
--queue-url "$(aws sqs get-queue-url --queue-name orders --output text)" \
--attribute-names RedrivePolicy \
--query 'Attributes.RedrivePolicy' --output text
Then prove the routing end to end: send a message the consumer is guaranteed to reject, let it cycle maxReceiveCount times, and watch it appear in the DLQ.
# After ~5 failed receives, the DLQ depth should increment to 1
aws sqs get-queue-attributes \
--queue-url "$(aws sqs get-queue-url --queue-name orders-dlq --output text)" \
--attribute-names ApproximateNumberOfMessages \
--query 'Attributes.ApproximateNumberOfMessages' --output text
A non-zero DLQ count plus a triggered orders-dlq-not-empty alarm confirms the policy, the destination, and the alerting are all wired correctly.
Gotchas & Edge Cases
Type mismatch between source and DLQ. A FIFO queue can only dead-letter to a FIFO DLQ, and a standard queue to a standard DLQ. Mixing them makes set-queue-attributes reject the redrive policy.
Retention shorter on the DLQ than the source. Message age is preserved across the move — a message does not get a fresh retention clock when it lands in the DLQ. If the DLQ retention is short, quarantined messages can expire and disappear before you triage. Set the DLQ to the 14-day maximum.
maxReceiveCount counts receives, not failures. A message picked up but not deleted because the worker crashed (or the visibility timeout was too short) still increments the count. A too-short visibility timeout will march valid messages into the DLQ; fix the timeout, not the ceiling — see the visibility timeout deep dive.
Redriving before fixing. StartMessageMoveTask cheerfully replays poison messages straight back into the DLQ if the root cause is unresolved. Deploy the fix, validate on one message, then redrive with a rate cap. The RabbitMQ equivalent procedure and its idempotency safeguards are in Replaying dead-lettered messages in RabbitMQ.
The DLQ with its own redrive policy. Do not attach a RedrivePolicy to the DLQ — it should be terminal. Otherwise a message can be dead-lettered out of the DLQ and lost.
Related
- Dead-Letter Queues & Poison-Message Handling — the design concepts behind redrive policies.
- Replaying Dead-Letter Messages in RabbitMQ — the equivalent quarantine-and-replay workflow on RabbitMQ.
- Queue Fundamentals & Architecture — broader queue topology and routing concepts.
- Visibility Timeout Deep Dive — why a bad timeout inflates the receive count and floods the DLQ.
- Scaling Queue Partitions in AWS SQS — companion SQS scaling techniques.