Rate Limiting and Throttling Async Jobs

A queue smooths bursty producers into a steady stream of work, but the consumers on the other side often hit something that has its own limits: a third-party API capped at 100 requests per second, a database that degrades past a certain write rate, or a billing system that charges per call. Without an explicit rate limit, a fleet of workers will happily drain the backlog as fast as it can and overwhelm whatever sits downstream. This guide is part of the broader Queue Fundamentals & Architecture collection and focuses on how to deliberately slow consumers down to a sustainable, fair, and cost-controlled pace.

Throttling job execution is not about throwing away work — the queue still holds every message — it is about controlling the rate at which work leaves the queue and reaches a constrained resource. Three forces usually drive the requirement:

  • Protecting downstream dependencies. External APIs return 429 Too Many Requests and may ban your credentials if you ignore their published limits. Databases and search clusters have a throughput ceiling beyond which latency climbs sharply.
  • Fairness across tenants. In a multi-tenant system, one customer enqueuing a million jobs should not starve everyone else. Per-tenant limits keep the shared worker pool equitable.
  • Cost. Many APIs and cloud services bill per request. A runaway retry loop or a backfill job can generate a surprise invoice unless the rate is bounded.

The reasoning here pairs naturally with the producer-consumer pattern: rate limiting is a control applied at the consumer edge, deciding when a worker is allowed to pull and execute the next job.

Token bucket vs leaky bucket Side-by-side diagram. Token bucket holds tokens refilled at a steady rate; a job consumes a token to run, allowing bursts up to bucket capacity. Leaky bucket holds queued jobs and releases them at a fixed drip rate, smoothing output with no bursts. Token Bucket Leaky Bucket refill: N tokens / sec capacity = burst limit job takes 1 token → runs no token → wait allows bursts inflow: variable / bursty drips out at fixed rate overflow → dropped/queued smooths output

Choosing an algorithm

The behavior you get depends entirely on the algorithm you pick. Four are common, and they differ in how they treat bursts and how much state they require.

A token bucket holds a number of tokens up to a fixed capacity and refills them at a steady rate (say 100 tokens per second). Each job must take one token to run; if the bucket is empty the worker waits. Because unused tokens accumulate up to the capacity, the token bucket permits bursts — an idle period banks tokens that a sudden surge can spend. This is the right model when the downstream system tolerates short spikes but enforces an average rate, which describes most public APIs.

A leaky bucket inverts the picture: jobs flow into a buffer and leak out at a constant rate, like water through a hole. Output is perfectly smooth and never bursts, which is ideal when the downstream resource cannot absorb spikes at all (a legacy system, a fragile database). The cost is that a burst either queues behind the leak or overflows and is rejected.

A fixed window counter simply counts jobs per calendar interval — 1000 per minute, reset at the top of each minute. It is trivial to implement with a single counter and TTL, but it allows a double-rate burst at the boundary: 1000 jobs at 11:00:59 and another 1000 at 11:01:00 means 2000 in two seconds. A sliding window fixes that by tracking timestamps (or a weighted blend of the current and previous window) so the limit holds across any rolling interval, at the price of more state.

| Algorithm | Bursts | State required | Boundary accuracy | Best for | ||---|---|---|---| | Token bucket | Allowed up to capacity | Token count + last-refill timestamp | High | API limits that allow short spikes | | Leaky bucket | Never (smoothed) | Queue + last-leak timestamp | High | Fragile downstreams needing constant rate | | Fixed window | Allowed at boundaries | Single counter + TTL | Low (2x boundary spike) | Coarse, cheap limits | | Sliding window | Controlled | Per-request timestamps or weighted counters | High | Strict rolling-rate enforcement |

For most queue workloads, the token bucket is the default recommendation: it maps cleanly onto "X requests per second on average, with a burst of Y," which is exactly how rate-limited APIs publish their contracts.

Where the limit lives: scope

An algorithm only describes how you count. You also have to decide what you count against — the scope of the limiter. The same downstream limit demands different scoping depending on your topology.

A per-worker limit lives entirely inside one process. It is the simplest possible setup (no shared state, no network round-trip) but it does not compose: if each of 10 workers limits itself to 10 requests per second, the downstream sees 100 per second. Per-worker limits are only correct when there is a single worker, or when the downstream limit is genuinely per-connection.

A per-queue limit caps the rate at which a particular queue is drained regardless of how many workers serve it. Frameworks like BullMQ implement this natively, and it is the cleanest abstraction when one queue maps to one downstream dependency.

A global (or per-tenant) limit is enforced across the entire fleet, which requires shared coordination state — almost always Redis. This is the only scope that correctly protects a downstream API behind a horizontally scaled worker pool, and it is the scope that enables fair multi-tenant sharing.

   per-worker            per-queue                global / per-tenant
  ┌─────────┐          ┌─────────┐               ┌─────────┐  ┌─────────┐
  │ worker  │          │ worker  │               │ worker  │  │ worker  │
  │ [limit] │          └────┬────┘               └────┬────┘  └────┬────┘
  └────┬────┘               │                         └─────┬──────┘
       │              ┌─────┴─────┐                  ┌───────┴────────┐
       ▼              │ queue rate│                  │  Redis limiter │
  downstream          │  limiter  │                  │ (shared count) │
                      └─────┬─────┘                  └───────┬────────┘
                            ▼                                ▼
                       downstream                       downstream

Because workers scale horizontally, any limit that must hold in aggregate has to be distributed. That pushes the counting logic into a shared store.

Distributed rate limiting with Redis

Redis is the standard substrate for distributed limiting because it is fast, already present in most queue stacks, and supports atomic server-side scripts. The critical requirement is atomicity: checking the remaining allowance and decrementing it must be a single indivisible operation, or two workers will both read "1 token left," both decrement, and both proceed — a classic check-then-act race.

A Lua script gives you that atomicity for free, because Redis executes a script as a single blocking unit. The token-bucket script below refills the bucket lazily based on elapsed time, then consumes a token if one is available, returning whether the caller may proceed.

-- token_bucket.lua
-- KEYS[1] = bucket key
-- ARGV[1] = capacity (max tokens)
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = now (unix seconds, fractional)
-- ARGV[4] = requested tokens (usually 1)
local capacity   = tonumber(ARGV[1])
local refill     = tonumber(ARGV[2])
local now        = tonumber(ARGV[3])
local requested  = tonumber(ARGV[4])

local data   = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(data[1])
local ts     = tonumber(data[2])

if tokens == nil then          -- first use: start full
  tokens = capacity
  ts = now
end

-- lazily add the tokens accrued since the last call
local delta = math.max(0, now - ts)
tokens = math.min(capacity, tokens + delta * refill)

local allowed = tokens >= requested
if allowed then
  tokens = tokens - requested
end

redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
-- expire idle buckets so memory does not leak
redis.call('EXPIRE', KEYS[1], math.ceil(capacity / refill) * 2)

return allowed and 1 or 0

Calling it from a worker is a single round-trip. The key is scoped to whatever granularity you need — one global key for a shared API, or one key per tenant for fairness.

import time
import redis

r = redis.Redis()
with open("token_bucket.lua") as f:
    consume = r.register_script(f.read())

def allow(bucket_key: str, capacity: int, refill_per_sec: float) -> bool:
    return bool(
        consume(
            keys=[bucket_key],
            args=[capacity, refill_per_sec, time.time(), 1],
        )
    )

# In the worker, before calling the downstream API:
while not allow("ratelimit:stripe-api", capacity=100, refill_per_sec=100):
    time.sleep(0.05)   # back off and retry; the job stays held
call_stripe()

This "lazy refill" technique avoids any background process topping up buckets — the math happens on read, so a bucket that is never touched costs nothing. The EXPIRE line keeps idle per-tenant keys from accumulating forever.

For a leaky bucket you would instead store the queue length and a last-leak timestamp, subtracting elapsed * leak_rate on each call; for a sliding window you would use a sorted set keyed by timestamp and ZREMRANGEBYSCORE to drop entries older than the window before counting. The token-bucket variant is shown in full because it is the most broadly applicable. A complete, productionized version applied to Python tasks is covered in token-bucket rate limiting for Celery tasks.

Framework support

You rarely have to build this from scratch — every major queue framework ships some form of rate limiting, though the scope and algorithm differ.

Celery exposes a per-task rate_limit string such as "100/m" or "10/s". It is enforced per worker process, using a token-bucket-like mechanism internally, which means it does not give you a true fleet-wide limit on its own. For coordinated limits across many workers you layer a Redis limiter on top, as shown above. Celery's configuration lives alongside the rest of your Celery architecture and configuration.

from celery import Celery

app = Celery("tasks", broker="redis://localhost:6379/0")

@app.task(rate_limit="100/m")   # per-worker, not global
def call_partner_api(payload):
    ...

BullMQ offers a per-queue limiter with max jobs per duration window, enforced across all workers consuming that queue through Redis — so it is a true shared limit. It is the cleanest framework-native option for fleet-wide throttling. See BullMQ for Node.js ecosystems for the surrounding setup.

import { Worker } from "bullmq";

const worker = new Worker("emails", processor, {
  connection,
  limiter: {
    max: 100,        // at most 100 jobs
    duration: 1000,  // per 1000 ms, shared across all workers on this queue
  },
});

Sidekiq does not include throttling in the open-source core; the commercial Sidekiq Enterprise adds concurrent and bucket-based limiters, and the community sidekiq-throttled gem provides concurrency and threshold limits keyed by job class or a custom key, backed by Redis.

require "sidekiq/throttled"

class PartnerSyncJob
  include Sidekiq::Job
  include Sidekiq::Throttled::Job

  sidekiq_throttle(
    threshold: { limit: 100, period: 1.minute, key_suffix: ->(account_id) { account_id } }
  )

  def perform(account_id)
    # ...
  end
end

Failure modes and recovery

Limiter outage. If the Redis backing your limiter is unreachable, you must decide fail-open or fail-closed. Failing open (proceed without limiting) protects job throughput but risks hammering the downstream; failing closed (block all jobs) protects the downstream but stalls the queue. For limits that exist to prevent bans or runaway cost, fail closed with a short retry; for soft fairness limits, fail open with an alert.

Clock skew. The lazy-refill math depends on a timestamp. If you pass the worker's clock to the script, skew between machines makes the rate drift. Use redis.call('TIME') inside the Lua script instead of an ARGV timestamp so every refill is computed against the single Redis clock.

Hot-key contention. A single global limiter key serializes every worker through one Redis slot. At very high rates this becomes a bottleneck. Shard the limit (e.g. give each of 4 shards 25 tokens/sec and hash workers to shards) to spread the load, accepting slightly looser aggregate enforcement.

Starvation under tight limits. When the limit is far below demand, jobs that poll-and-sleep can livelock or pile up retries. Prefer delaying the job back into the queue with a computed delay (using scheduled and delayed jobs) over busy-waiting inside the worker, so the worker thread is freed for other work.

Performance tuning

  • Co-locate the limiter with the broker. If you already run Redis as your broker, run the limiter on the same cluster (a separate logical database) to avoid a second network hop.
  • Batch token consumption. A job that makes 5 downstream calls should request 5 tokens in one script call, not loop five times.
  • Right-size burst capacity. Set bucket capacity to the largest burst the downstream tolerates, not higher — excess capacity defeats the smoothing the limiter exists to provide.
  • Prefer delay-requeue over sleep. Sleeping holds a worker thread idle; requeuing with a delay returns it to the pool. Reserve in-process sleep for sub-second waits.
  • Monitor the wait, not just the rate. Track how long jobs spend waiting on a token. Rising wait time means demand has outgrown the limit and the backlog will grow.

FAQ

Should I rate limit at the producer or the consumer? Almost always the consumer. The whole point of a queue is to let producers enqueue freely and absorb bursts; limiting belongs at the consumer edge where work meets the constrained resource. Limiting the producer throws away the queue's buffering benefit.

What is the difference between rate limiting and concurrency limiting? A rate limit caps how many jobs run per unit of time; a concurrency limit caps how many run simultaneously. They solve different problems — rate protects a per-second API quota, concurrency protects a connection pool or memory budget — and you often apply both. Concurrency tuning is covered under worker scaling rather than here.

Why use a Lua script instead of INCR with EXPIRE? INCR+EXPIRE implements a fixed window and suffers the 2x boundary burst. More importantly, any multi-step check (read remaining, decide, decrement) across separate commands is not atomic and races between workers. A Lua script runs as one indivisible operation, eliminating the race and letting you express token-bucket or sliding-window logic correctly.

Does Celery's rate_limit give me a global limit? No. Celery's rate_limit is enforced per worker process. Ten workers with "10/s" produce up to 100 per second in aggregate. For a true fleet-wide limit, add a Redis-backed limiter as described above.

Related