Celery Task Retry & Error Handling
This guide expands on the retry primitives introduced in Celery Architecture & Configuration and the broader patterns in Backend Frameworks & Worker Scaling, turning ad-hoc self.retry() calls into a deterministic, observable failure-handling strategy.
A naive Celery task that calls an external API will eventually hit a ConnectionError, a 429 Too Many Requests, or a worker that gets SIGKILLed mid-execution by Kubernetes. Without explicit retry policy, transient failures become permanent task loss, and a single poison payload can be redelivered forever, saturating a queue. The symptom is usually a flood of identical tracebacks in the logs followed by missing side effects — emails never sent, payments never captured — with no record of where the work went. This page builds a layered policy: declarative autoretry for transient errors, capped exponential backoff with jitter to avoid retry storms, durable acknowledgment so crashes don't drop work, and an explicit dead-letter path for tasks that exhaust their budget.
Prerequisites
- A working Celery 5.x app with a broker and result backend configured (see Setting up Celery with Redis broker and RabbitMQ backend).
- RabbitMQ as broker if you want native dead-letter exchange routing (Redis works for the application-level fallback shown below).
celery>=5.3, Python 3.9+, and the ability to restart workers to pick up config changes.- Tasks designed to be idempotent — retries assume re-execution is safe. Review preventing duplicate job execution with idempotency before enabling aggressive retries.
Step 1 — Declarative retries with autoretry_for
Manually wrapping every task body in try/except self.retry() is repetitive and error-prone. The autoretry_for argument on the task decorator tells Celery which exception classes should trigger an automatic retry, so the task body stays focused on business logic.
# tasks.py
from celery import shared_task
from requests.exceptions import ConnectionError, Timeout
@shared_task(
bind=True,
autoretry_for=(ConnectionError, Timeout), # only retry transient network errors
max_retries=5,
)
def fetch_profile(self, user_id):
resp = http_client.get(f"/users/{user_id}", timeout=10)
resp.raise_for_status()
return resp.json()
Be deliberate about which exceptions you list. A ValueError from a malformed payload will never succeed on retry — it is a poison message, and retrying it just wastes worker capacity. Reserve autoretry_for for errors that are genuinely transient (network, rate limits, lock contention).
Step 2 — Exponential backoff with jitter
Retrying immediately hammers a struggling downstream service and can turn a brief blip into a self-inflicted outage. retry_backoff enables exponential delay (2s, 4s, 8s, …), retry_backoff_max caps the ceiling, and retry_jitter randomizes the delay to prevent thundering-herd synchronization when many tasks fail at once.
# tasks.py
@shared_task(
bind=True,
autoretry_for=(ConnectionError, Timeout),
retry_backoff=2, # first retry waits ~2s, then 4s, 8s, 16s ...
retry_backoff_max=600, # never wait longer than 10 minutes between retries
retry_jitter=True, # randomize delay in [0, computed_delay] to spread load
max_retries=8,
)
def sync_to_crm(self, record_id):
crm.upsert(record_id)
With retry_jitter=True, the actual countdown is a random value between zero and the computed exponential delay, so 1,000 tasks that all failed against the same API at the same instant won't all retry in lockstep.
Step 3 — Fine-grained control with retry_kwargs and explicit self.retry()
When you need different behavior per call site, or want to attach a custom countdown computed at runtime, raise the retry explicitly. retry_kwargs sets defaults at the decorator level; an explicit self.retry() overrides them for a specific failure.
# tasks.py
@shared_task(
bind=True,
retry_kwargs={"max_retries": 3}, # default budget for this task
)
def charge_card(self, payment_id):
try:
gateway.charge(payment_id)
except gateway.RateLimited as exc:
# honor the gateway's Retry-After header instead of a fixed schedule
raise self.retry(exc=exc, countdown=int(exc.retry_after))
except gateway.CardDeclined as exc:
# permanent business failure — do not retry, surface to caller
raise exc
The distinction matters: RateLimited is transient and gets a server-directed countdown, while CardDeclined is terminal and propagates immediately rather than burning the retry budget.
Step 4 — Survive worker crashes with acks_late and reject_on_worker_lost
By default Celery acknowledges a message to the broker before the task runs. If the worker is killed mid-task (OOM, pod eviction, SIGKILL), that message is already gone and the work is lost. Setting task_acks_late=True defers the acknowledgment until the task finishes, so a crash leaves the message on the broker for redelivery. Pair it with task_reject_on_worker_lost=True so the message is requeued rather than silently dropped when the worker connection is severed.
# celeryconfig.py
task_acks_late = True # ack only after the task completes
task_reject_on_worker_lost = True # requeue (not discard) if the worker dies mid-task
worker_prefetch_multiplier = 1 # don't hoard messages a crashed worker can't finish
This trades at-most-once for at-least-once delivery — the task may now run twice if a worker dies after doing its work but before acking. That is exactly why Step 0's idempotency requirement is non-negotiable.
Step 5 — Run cleanup logic with on_failure and after_return hooks
When retries are finally exhausted, you want a single place to emit metrics, alert, and capture context. Subclass celery.Task and override on_failure, which fires after the last retry fails.
# base.py
from celery import Task
class TrackedTask(Task):
def on_failure(self, exc, task_id, args, kwargs, einfo):
# fires once, after the final retry has been exhausted
metrics.increment("celery.task.failed", tags={"task": self.name})
logger.error(
"task permanently failed",
extra={"task_id": task_id, "task": self.name, "args": args, "exc": repr(exc)},
)
super().on_failure(exc, task_id, args, kwargs, einfo)
def on_retry(self, exc, task_id, args, kwargs, einfo):
metrics.increment("celery.task.retried", tags={"task": self.name})
super().on_retry(exc, task_id, args, kwargs, einfo)
Wire the base class into the decorator so every tracked task inherits the hooks:
# tasks.py
@shared_task(base=TrackedTask, bind=True, autoretry_for=(ConnectionError,), max_retries=5)
def deliver_webhook(self, endpoint, payload):
http_client.post(endpoint, json=payload, timeout=15)
Step 6 — Route exhausted tasks to a dead-letter queue
Tasks that exhaust their retries should not vanish — they belong in a quarantine queue for inspection and replay. The cleanest approach with RabbitMQ is a broker-level dead-letter exchange, but you can also forward to a dedicated queue from the on_failure hook, which works with any broker including Redis. The concepts here are covered in depth under Dead-letter queues & poison messages.
First, declare a quarantine queue with RabbitMQ's dead-letter routing so messages that are rejected (not requeued) flow to it automatically:
# celeryconfig.py
from kombu import Queue, Exchange
task_queues = (
Queue(
"default",
Exchange("default"),
routing_key="default",
queue_arguments={
"x-dead-letter-exchange": "dlx", # rejected messages go here
"x-dead-letter-routing-key": "dead",
},
),
Queue("dead_letter", Exchange("dlx"), routing_key="dead"),
)
For an application-level path that is broker-agnostic, forward the failed payload from the failure hook to an explicit dead-letter task:
# base.py
class DLQTask(Task):
def on_failure(self, exc, task_id, args, kwargs, einfo):
quarantine_task.apply_async(
kwargs={
"original_task": self.name,
"args": args,
"kwargs": kwargs,
"error": repr(exc),
"traceback": str(einfo),
},
queue="dead_letter",
)
super().on_failure(exc, task_id, args, kwargs, einfo)
@shared_task(queue="dead_letter")
def quarantine_task(**failed):
# persist for manual inspection / scheduled replay
DeadLetter.objects.create(**failed)
Run a dedicated worker that consumes only dead_letter so quarantined work never competes with live traffic for capacity:
# start a low-concurrency worker drained off the dead-letter queue
celery -A myapp worker --queues=dead_letter --concurrency=1 --loglevel=INFO
Verification
Confirm the policy behaves as intended before trusting it in production.
Inspect the computed retry schedule and exception filter at runtime:
# python shell
from myapp.tasks import sync_to_crm
print(sync_to_crm.max_retries) # -> 8
print(sync_to_crm.retry_backoff) # -> 2
print(sync_to_crm.retry_jitter) # -> True
print(sync_to_crm.autoretry_for) # -> (ConnectionError, Timeout)
Force a transient failure and watch the eventing stream to see retries fire and the message land in quarantine:
# stream task events; look for task-retried then task-failed, then a dead_letter enqueue
celery -A myapp events --dump
Assert the dead-letter path in an integration test:
# tests/test_retry.py
def test_exhausted_task_is_quarantined(monkeypatch):
monkeypatch.setattr("myapp.tasks.crm.upsert", _always_raises(ConnectionError))
sync_to_crm.apply(args=[42]) # runs eagerly through all retries
assert DeadLetter.objects.filter(original_task="myapp.tasks.sync_to_crm").exists()
Gotchas & edge cases
autoretry_forcatches subclasses. ListingExceptionwill retry everything, including programming bugs and poison payloads, masking permanent failures as transient ones. List the narrowest exception classes that are actually recoverable.max_retries=Nonemeans infinite. A poison message with no retry cap and no dead-letter exit will loop forever, consuming a worker slot indefinitely. Always set a finitemax_retriesand a dead-letter path together.acks_lateplus a longvisibility_timeoutmasks crashes. With Redis as broker, a dead worker's message isn't redelivered until the visibility timeout elapses. Keep the timeout aligned with your p99 task duration or recovery will feel stuck.on_failureruns only after the final retry. It does not fire on intermediate retries — useon_retryfor per-attempt instrumentation. Counting failures inon_retrywill over-report.retry_backoffis ignored when you callself.retry(countdown=...). An explicit countdown wins, so don't expect the exponential schedule if your code overrides it per call.
Related
- Celery Architecture & Configuration — the broker, backend, and acknowledgment settings that retries depend on.
- Setting up Celery with Redis broker and RabbitMQ backend — broker setup that determines whether native dead-letter exchanges are available.
- Celery Beat periodic task scheduling — sibling guide for scheduling the replay and cleanup jobs that drain a dead-letter queue.
- Dead-letter queues & poison messages — broker-level patterns for quarantining and replaying failed work.
- Backend Frameworks & Worker Scaling — how retry policy interacts with concurrency and horizontal scaling.