Scheduled and Delayed Jobs

Most jobs run as soon as a worker is free, but a large class of work is deliberately time-shifted: send the reminder email in 24 hours, retry the failed webhook in 5 minutes, run the nightly report at 02:00, charge the subscription on the renewal date. These are delayed jobs (run once, after a delay or at a specific time) and recurring jobs (run repeatedly on a schedule). This guide sits within Queue Fundamentals & Architecture and explains how brokers actually implement time-based delivery, where the sharp edges are, and how to keep schedules correct under failure.

The core mechanic to internalize is that a delayed job is not "running and sleeping." A worker holding a job for an hour wastes a thread and breaks redelivery semantics. Instead, the job is stored somewhere with a due time and is not made available to any worker until that time arrives. The interesting differences between systems are entirely in how that "store until due" step is built.

Two distinct concepts are worth separating up front:

  • Delayed delivery — a one-shot job that becomes runnable at a future moment. Celery expresses this as countdown (seconds from now) or eta (an absolute timestamp).
  • Recurring schedules — a job that fires on a repeating cron-like pattern, which requires a scheduler process that emits new jobs at each tick. This is a different responsibility from the queue itself.
Lifecycle of a delayed job A horizontal timeline. At T0 the job is enqueued with a due time. It sits in a delayed store, hidden from workers, through the delay window. At the due time it becomes ready and moves to the ready queue. A worker then dequeues and runs it. delay window — hidden from all workers ready queue T0: enqueue due = T0 + delay due time becomes ready worker dequeues job runs no worker thread is held during the delay

How brokers implement delay

There is no single mechanism — each system makes a different trade-off between maximum delay, precision, and operational simplicity.

AWS SQS offers DelaySeconds, set per-message or as a queue default. It is dead simple but capped at 15 minutes. Anything longer than that cannot be expressed natively; the common workarounds are to chain delays, store the due time externally and re-enqueue, or use EventBridge Scheduler / Step Functions for longer waits. The 15-minute ceiling is the single most important SQS scheduling fact to remember.

Redis sorted sets are the workhorse for arbitrary-length delays. You ZADD the job into a sorted set with the due timestamp as the score, then a poller periodically runs ZRANGEBYSCORE key -inf <now> to find everything now due and atomically moves it to a ready list. There is no built-in maximum delay and precision is whatever your poll interval allows. This is exactly how Celery's Redis transport, Sidekiq's scheduled set, and many hand-rolled schedulers work. A full build is walked through in implementing delayed jobs with Redis sorted sets.

RabbitMQ has no native delay in the core broker. The two options are the official rabbitmq_delayed_message_exchange plugin, which holds messages in the broker until their x-delay header expires and then routes them, and the older "dead-letter TTL" trick, where a message sits in a holding queue with a per-message TTL and is dead-lettered to the real queue when it expires. The plugin is cleaner; the TTL trick works without plugins but has quirks (TTL is honored only at the head of the queue).

BullMQ has first-class delayed jobs: pass { delay: ms } when adding a job and it lands in a delayed set scored by due time, which BullMQ's own machinery promotes to the wait list when due — the Redis sorted-set pattern, productized.

| Broker | Mechanism | Max delay | Precision driver | Recurring support | ||---|---|---|---| | AWS SQS | DelaySeconds | 15 minutes | Server-side timer | None (use EventBridge) | | Redis sorted set | ZADD score = due_ts + poller | Unbounded | Poll interval | Build with a scheduler | | RabbitMQ | Delayed-message plugin or DLX+TTL | Unbounded (plugin) | Plugin timer / TTL granularity | None (external scheduler) | | BullMQ | delay option | Unbounded | Internal promotion loop | repeat option (cron) |

Delayed delivery in practice

In Celery, a one-shot delay is a single argument. countdown is relative; eta is absolute and must be timezone-aware. Under the hood with the Redis transport, Celery places the message in a scheduled set keyed by due time — the same sorted-set pattern described above.

from datetime import datetime, timedelta, timezone

# run roughly 5 minutes from now
send_reminder.apply_async(args=[user_id], countdown=300)

# run at a specific absolute time (always pass tz-aware datetimes)
run_at = datetime(2026, 7, 1, 9, 0, tzinfo=timezone.utc)
send_invoice.apply_async(args=[account_id], eta=run_at)

The same intent in BullMQ:

await queue.add(
  "send-reminder",
  { userId },
  { delay: 5 * 60 * 1000 }   // 5 minutes in ms; job is hidden until due
);

And on SQS, remembering the 15-minute ceiling:

import boto3
sqs = boto3.client("sqs")
sqs.send_message(
    QueueUrl=queue_url,
    MessageBody=body,
    DelaySeconds=600,   # max 900 (15 min) — longer delays need a different approach
)

For delays beyond what a broker supports, the robust pattern is to store the absolute due time with the job and let a sorted-set scheduler own the wait, rather than chaining short delays (which compounds drift and clutters the queue).

Recurring schedules in practice

Recurring work needs a scheduler — a process that wakes on a cadence and enqueues jobs. The defining hazard is that the scheduler is a singleton concern: if two schedulers run, every cron entry fires twice. Celery's scheduler is Celery Beat, configured with a beat_schedule of crontab entries:

from celery import Celery
from celery.schedules import crontab

app = Celery("tasks", broker="redis://localhost:6379/0")
app.conf.timezone = "UTC"   # define the reference clock explicitly

app.conf.beat_schedule = {
    "nightly-report": {
        "task": "reports.generate_nightly",
        "schedule": crontab(hour=2, minute=0),   # 02:00 every day
    },
    "every-15-min-sync": {
        "task": "sync.pull_changes",
        "schedule": crontab(minute="*/15"),
    },
}

Running exactly one Beat instance is the whole game; the patterns for guaranteeing that (single-beat locking, a database-backed scheduler) are detailed in cron-style scheduling with Celery Beat and in the framework-level Celery Beat periodic task scheduling guide. BullMQ folds recurrence into the queue itself via a repeat option with a cron pattern, so no separate scheduler process is required.

Failure modes and recovery

Clock skew. Every delay mechanism compares a due time against "now." If the producer, the broker, and the scheduler disagree on the clock, jobs fire early or late. Always store due times as absolute UTC instants, and compute "now" from a single authority where possible — the broker's clock (e.g. redis TIME) rather than each worker's local clock. Run NTP everywhere.

Missed schedules after downtime. If Beat is down at 02:00, the nightly job simply does not fire — Beat does not backfill by default. Decide explicitly whether a missed run should be skipped or caught up. For catch-up semantics, persist the last-run time and have the task itself detect and process the gap, or use a scheduler that records run history.

Duplicate runs. A restarted scheduler, an overlapping second scheduler, or an at-least-once enqueue can all fire the same scheduled job twice. Make scheduled tasks idempotent — key the work to its intended fire time (e.g. a date-stamped row) so a second execution is a no-op. This is the same discipline covered in preventing duplicate job execution with idempotency.

Poller stalls (sorted-set schedulers). If the poller that promotes due jobs from the sorted set crashes, delayed jobs silently never become ready. The promotion step must be atomic (so a crash mid-promotion does not lose or duplicate jobs) and the poller must be supervised and monitored for liveness — covered in the sorted-set build guide.

Interaction with redelivery. A delayed job that becomes due is an ordinary job once it reaches the ready queue, so it is subject to the usual visibility timeout and redelivery rules. The delay only governs when it becomes runnable, not what happens during execution.

Performance tuning

  • Match poll interval to required precision. A 1-second poll on a Redis sorted set gives roughly second-level accuracy at the cost of one cheap ZRANGEBYSCORE per second. Don't poll every 10ms for jobs that tolerate seconds of jitter.
  • Promote in batches. When many jobs come due at once, fetch and move them in one ranged operation rather than one round-trip per job.
  • Avoid thundering herds at round times. Jobs all scheduled for :00 create a synchronized spike. Add small random jitter to non-critical schedules to spread load.
  • Keep due times absolute. Storing "run in 3600s" forces re-computation and drifts; storing an absolute UTC instant is stable across restarts and clock corrections.
  • Separate the scheduler from workers. Run Beat (or the poller) as its own supervised process so scaling workers up and down never accidentally spawns or kills the scheduler.

FAQ

What is the difference between a delayed job and a scheduled job? A delayed job runs once at a future time (a relative countdown or an absolute due time). A scheduled (recurring) job fires repeatedly on a cron-like pattern and requires a scheduler process to emit each occurrence. Delay is a property of one message; recurrence is a property of a schedule.

Why can't I just sleep inside a worker until the job is due? Sleeping holds a worker thread idle for the entire delay, which wastes capacity and breaks redelivery — if that worker dies, the broker may redeliver and the timing is lost. The correct approach stores the job with a due time and makes it runnable only when due, leaving the worker free in the meantime.

How do I delay a job for longer than SQS's 15-minute limit? SQS DelaySeconds caps at 900 seconds. For longer waits, store the absolute due time and use a sorted-set scheduler, EventBridge Scheduler, or Step Functions Wait states, rather than chaining 15-minute delays which compounds drift.

How do I stop a recurring job from running twice? Run exactly one scheduler instance (using a lock or a database-backed scheduler) and make the task idempotent by keying its work to the intended fire time, so a duplicate enqueue becomes a no-op.

Related