Observability & Monitoring for Job Queues

Async job systems fail silently. A web request that errors returns a 500 to a user who retries; a task that stalls in a queue produces no response at all, so the only evidence is a slowly growing backlog and a customer wondering why their export never arrived. Observability is what converts that invisible degradation into a signal you can alert on, page on, and debug before it becomes an incident.

This guide covers how to instrument and monitor distributed worker fleets running Celery, BullMQ, and Sidekiq. It builds on the delivery semantics and broker topology described in Queue Fundamentals & Architecture and the scaling models in Backend Frameworks & Worker Scaling. The focus is the four signals that actually predict queue health, the three telemetry types that capture them, and the tooling β€” Prometheus exporters, Grafana dashboards, Flower, and OpenTelemetry tracing β€” that turns raw counters into operational confidence.

The Four Signals of Queue Health

Google's golden signals (latency, traffic, errors, saturation) translate cleanly onto task queues, but the queue-specific framing is sharper. Four measurements explain almost every production incident a worker fleet will experience.

  • Queue depth / backlog β€” the number of jobs waiting to be processed. A backlog that grows monotonically means arrival rate exceeds completion rate; the system is falling behind and will never catch up without intervention. This is the single most important queue signal.
  • Worker saturation β€” the fraction of worker capacity actively executing jobs. At 100% saturation with a growing backlog, you are throughput-bound and need more workers. At low saturation with a growing backlog, you have a bottleneck downstream (a slow database, an exhausted connection pool, a poison message blocking a partition).
  • Task latency β€” measured as a distribution, not an average. You need p50 (typical experience), p95 (the tail most users hit occasionally), and p99 (the worst case that drives timeouts and SLO burn). Latency must be split into queue wait time (enqueue β†’ dequeue) and execution time (dequeue β†’ ack), because they have completely different remediations.
  • Failure / retry rate β€” the proportion of jobs that error and the proportion that are retried. A retry storm consumes throughput silently: workers are busy, saturation looks healthy, but they are re-running the same failing jobs, masking a real backlog.

These four signals interact. The diagnostic value comes from reading them together: high backlog plus high saturation is a scaling problem; high backlog plus low saturation plus high failure rate is a poison-message or downstream-dependency problem. No single metric tells the story.

Queue observability instrumentation data flow Workers emit metrics to a Prometheus exporter. Prometheus scrapes the exporter on an interval, stores time series, evaluates recording and alerting rules, then serves Grafana dashboards and pushes firing alerts to Alertmanager, which pages on-call. Worker Fleet Celery / BullMQ Sidekiq counters / gauges histograms Exporter /metrics endpoint Prometheus scrape + TSDB recording + alert rules Grafana dashboards Alertmanager paging / on-call scrape query fire

Metrics vs. Logs vs. Traces

The three pillars of telemetry answer different questions, and queue observability needs all three. Treating them as interchangeable is a common and expensive mistake.

Metrics are pre-aggregated numeric time series β€” counters, gauges, histograms. They are cheap to store and fast to query, which makes them the foundation for dashboards and alerting. A metric answers how many and how fast: how many jobs are queued, how long the p99 task takes, what the failure rate is over the last five minutes. Metrics cannot tell you which job failed or why; they have bounded cardinality by design.

Logs are timestamped, structured records of discrete events. A log line answers what happened to this specific job: the job ID, the arguments, the exception traceback, the worker hostname. Logs are high-cardinality and expensive at volume, but indispensable for forensic debugging. Structure them as JSON with consistent field names (job_id, task_name, queue, attempt, duration_ms) so they are queryable.

Traces capture the causal path of a single request as it crosses service boundaries β€” including the asynchronous boundary into the queue and back out on a worker. A trace answers where did the time go across the whole lifecycle: the API span that enqueued the job, the time the job spent waiting, the worker span that executed it, and the database spans inside that execution. Tracing across the broker requires propagating trace context through the message, covered in Distributed Tracing for Async Jobs.

The practical division of labour: alert on metrics, debug with traces, and confirm root cause with logs. When backlog crosses a threshold, a metric fires the page. You open the dashboard, see p99 execution time spiking, pull a trace to find the slow database span, then read the worker logs for that job ID to see the exact query. Each tool hands off to the next.

Counters, Gauges, and Histograms for Jobs

Prometheus exposes three core instrument types, and choosing correctly is the difference between a useful metric and a misleading one. The annotated example below shows the canonical instrumentation for a worker, using the Python prometheus_client.

from prometheus_client import Counter, Gauge, Histogram

# Counter: monotonically increasing total. Use rate() in queries, never the raw value.
JOBS_PROCESSED = Counter(
    "jobs_processed_total",
    "Total jobs that finished, by terminal outcome",
    ["queue", "task", "status"],   # status: success | failure | retry
)

# Gauge: a value that goes up and down. Backlog and in-flight counts are gauges.
QUEUE_DEPTH = Gauge(
    "queue_depth",
    "Jobs currently waiting in the queue (not yet picked up)",
    ["queue"],
)
WORKERS_ACTIVE = Gauge(
    "workers_active",
    "Workers currently executing a job (saturation numerator)",
    ["queue"],
)

# Histogram: bucketed distribution. The ONLY correct way to get latency percentiles.
TASK_DURATION = Histogram(
    "task_duration_seconds",
    "Wall-clock execution time per task",
    ["queue", "task"],
    # Buckets must span your real latency range; defaults assume sub-10s HTTP requests.
    buckets=(0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300),
)

def run_job(queue, task, fn, *args):
    WORKERS_ACTIVE.labels(queue, task).inc()
    try:
        with TASK_DURATION.labels(queue, task).time():
            fn(*args)
        JOBS_PROCESSED.labels(queue, task, "success").inc()
    except Exception:
        JOBS_PROCESSED.labels(queue, task, "failure").inc()
        raise
    finally:
        WORKERS_ACTIVE.labels(queue, task).dec()

The label cardinality rule is non-negotiable: never put a job ID, user ID, or any unbounded value in a label. Each unique label combination creates a separate time series, and high cardinality will crater Prometheus memory. Keep labels to bounded dimensions β€” queue name, task name, status β€” and push the high-cardinality detail (the actual job ID) into logs and traces instead. The deeper treatment of instrument design and scrape configuration lives in Prometheus Metrics for Workers.

Instrumenting Celery, BullMQ, and Sidekiq

Each framework exposes its internal state differently, and the cleanest instrumentation strategy differs accordingly.

Celery does not expose Prometheus metrics natively. The community standard is celery-exporter, which connects to the broker, listens to Celery's task event stream (task-sent, task-started, task-succeeded, task-failed, task-retried), and translates those events into Prometheus metrics. You enable events with worker_send_task_events = True and run the exporter as a sidecar. The full setup is in Instrumenting Celery with a Prometheus exporter.

# celery_app.py β€” emit task events so an exporter can observe lifecycle transitions
from celery import Celery

app = Celery("tasks", broker="redis://redis:6379/0")
app.conf.update(
    worker_send_task_events=True,   # publishes task-* events to the event bus
    task_send_sent_event=True,      # also emit task-sent at enqueue time (queue-wait visibility)
    task_track_started=True,        # emit task-started so you can measure queue wait separately
)

BullMQ keeps queue state in Redis, so you can read it directly with the queue API and expose your own gauges. queue.getJobCounts() returns waiting, active, delayed, completed, and failed counts in one call β€” exactly the backlog and saturation inputs you need.

// bullmq-metrics.js β€” scrape BullMQ's Redis-backed counts into Prometheus gauges
import { Queue } from "bullmq";
import { Gauge, register } from "prom-client";

const queue = new Queue("emails", { connection: { host: "redis" } });
const depth = new Gauge({ name: "bull_queue_depth", help: "waiting jobs", labelNames: ["queue", "state"] });

setInterval(async () => {
  const counts = await queue.getJobCounts("waiting", "active", "delayed", "failed");
  for (const [state, n] of Object.entries(counts)) {
    depth.labels("emails", state).set(n);   // one gauge, separated by state label
  }
}, 5000);   // 5s aligns with the Prometheus scrape interval; do not poll faster than you scrape

Sidekiq ships its own Web UI and exposes rich stats via Sidekiq::Stats and the Ruby API. For Prometheus, yabeda-sidekiq instruments jobs and queues with minimal configuration; the built-in Web UI gives instant operational visibility without any exporter at all. Sidekiq's queue latency (Sidekiq::Queue#latency, the age of the oldest job) is one of the best out-of-the-box backlog signals in any framework.

Monitoring Tooling Compared

No single tool covers every need. Most production setups combine a real-time operational view (Flower or Sidekiq Web UI) with a metrics-and-alerting backbone (Prometheus + Grafana), and a hosted platform if the team prefers managed infrastructure.

| Tool | Best for | Data model | Alerting | Retention | Cost | ||---|---|---|---|---| | Prometheus + Grafana | Metrics, dashboards, alerting backbone | Pull-based time series | Alertmanager rules | Configurable (weeks–months) | Self-hosted / free | | Flower | Real-time Celery task & worker inspection | Live event stream | Minimal | Ephemeral (in-memory) | Free, Celery-only | | Sidekiq Web UI | Real-time Ruby/Sidekiq queue ops | Live Redis reads | None | Live only | Bundled with Sidekiq | | Datadog | Turnkey APM + metrics + traces, managed | Hosted, push agent | Rich, built-in | Managed (15mo) | Per-host SaaS |

The pragmatic combination for a Celery shop is Prometheus + Grafana for trends and paging, plus Flower for the "what is happening right now, which task is stuck" question that aggregated metrics can never answer. Grafana is the canvas where the four signals come together; building those panels is covered in Grafana Dashboards for Queues.

OpenTelemetry Tracing Across the Async Boundary

Metrics tell you the p99 is slow; only a trace tells you which span inside the job is slow. The hard part of tracing a queue is that the synchronous trace ends when the producer enqueues, and a new one would normally start when the worker dequeues β€” breaking the causal chain at exactly the boundary you most want to see across.

OpenTelemetry solves this by serializing the active trace context (the traceparent header) into the job payload at enqueue time and restoring it as the parent context at dequeue time. The result is a single trace spanning the API request, the wait in the broker, and the worker execution, with the queue wait time visible as the gap between the enqueue span and the start of the execute span.

# Inject trace context at enqueue, extract at dequeue, to stitch one continuous trace
from opentelemetry import trace
from opentelemetry.propagate import inject, extract

tracer = trace.get_tracer(__name__)

def enqueue(payload: dict):
    carrier = {}
    inject(carrier)                 # writes traceparent into the carrier dict
    payload["_otel"] = carrier      # ride the context along inside the message body
    broker.send(payload)

def on_message(payload: dict):
    ctx = extract(payload.get("_otel", {}))   # restore producer's context as parent
    with tracer.start_as_current_span("job.execute", context=ctx):
        handle(payload)

Spans should bracket each lifecycle stage β€” job.enqueue, job.dequeue, job.execute β€” and carry the queue name and task as span attributes. Sampling keeps the cost bounded: tail-based sampling that always keeps traces containing an error is the high-value default. The full propagation mechanics are in Distributed Tracing for Async Jobs.

Production Instrumentation Checklist

  1. Emit the four signals from every worker. Backlog gauge, active-workers gauge, a task_duration_seconds histogram with buckets matched to your real latency range, and a jobs_processed_total counter labelled by status. Without all four you are flying with instruments missing.
  2. Expose a /metrics endpoint per worker process or sidecar. For Celery, run celery-exporter against the broker. For BullMQ, run a small Node service that polls getJobCounts(). For Sidekiq, add yabeda-sidekiq.
  3. Set the scrape interval to 15s and never poll faster than you scrape. Match any in-app polling loop (like the BullMQ gauge updater) to the scrape interval to avoid wasted Redis round-trips.
  4. Pin label cardinality. Audit every metric for unbounded labels. Job IDs, user IDs, and URLs belong in logs and traces, not metric labels.
  5. Separate queue-wait latency from execution latency. Instrument task-sent and task-started (Celery) or enqueue/dequeue timestamps so you can tell a scheduling problem from a slow handler.
  6. Add recording rules for expensive queries. Pre-compute job:failure_rate:5m and job:throughput:5m so dashboards and alerts read cheap derived series instead of recomputing rate() over raw counters.
  7. Structure logs as JSON with a stable schema. At minimum job_id, task, queue, attempt, status, duration_ms, and the trace_id so a metric alert links straight to the relevant logs and trace.
  8. Propagate trace context through the broker. Inject traceparent at enqueue and extract at dequeue so the wait time across the queue is one continuous span.
  9. Define SLOs before writing alerts. Decide the target (e.g. p99 execution under 5s, backlog drained within 10 minutes) first; alert thresholds derive from the SLO, not from a round number.
  10. Test the alert path end-to-end. Force a backlog in staging and confirm the page actually reaches on-call. An untested alert is not an alert.

Observability, SLOs & Alerting

Alerting on raw queue depth is a beginner's mistake β€” depth fluctuates with normal traffic, and a static threshold either pages constantly or never. Alert on trend and trajectory: a backlog that is growing faster than workers can drain it, or a backlog whose time-to-drain exceeds your SLO.

Define SLOs in terms users feel. A good queue SLO pair: "99% of jobs start execution within 60 seconds of enqueue" (a queue-wait SLO that catches scaling problems) and "99% of jobs complete within their per-task budget" (an execution SLO that catches handler regressions). Alert thresholds then derive from the error budget burn rate, not from arbitrary numbers.

The PromQL below shows the core queue alerts. The first catches a backlog the fleet cannot drain; the second catches a retry storm masquerading as healthy saturation.

# Backlog growing AND the fleet cannot drain it within the 10-minute SLO.
# predict_linear projects the depth 600s ahead based on the last 30m trend.
predict_linear(queue_depth{queue="default"}[30m], 600) > 0
  and rate(jobs_processed_total{queue="default", status="success"}[5m]) <
      rate(jobs_processed_total{queue="default", status="failure"}[5m])
      + (queue_depth{queue="default"} / 600)
# Failure rate over 5% of completed jobs β€” a retry storm or downstream outage.
sum(rate(jobs_processed_total{status="failure"}[5m])) by (queue)
  /
sum(rate(jobs_processed_total{status=~"success|failure"}[5m])) by (queue)
  > 0.05
# p99 execution latency from the histogram, regression-detecting against the SLO budget.
histogram_quantile(
  0.99,
  sum(rate(task_duration_seconds_bucket{queue="default"}[5m])) by (le)
) > 5

Detailed alert tuning, for: durations to suppress flapping, and recording-rule design are covered in Alerting on queue backlog with Prometheus.

FAQ

Why isn't average task latency good enough β€” why do I need percentiles? Averages hide the tail that actually hurts users. A task that runs in 50ms for 95% of calls and 30s for the slowest 5% has a comfortable-looking average but a brutal p99, and it is the p99 that drives timeouts, SLO burn, and the complaints you get paged for. Percentiles come from histograms; a single average gauge cannot reconstruct them after the fact, so you must instrument the histogram up front.

Should I alert on absolute queue depth or on its rate of change? Rate of change and projected time-to-drain, not absolute depth. Absolute depth is meaningless without context β€” 10,000 queued jobs is fine if the fleet clears 50,000/minute and catastrophic if it clears 100/minute. Use predict_linear over the depth gauge to project whether the backlog will breach your drain-time SLO, and alert on that projection.

Do I need both Flower and Prometheus, or is one enough? They answer different questions and most teams run both. Prometheus and Grafana give you trends, history, and alerting β€” the "is the system healthy over time" view. Flower gives you live, per-task inspection β€” "which task is stuck right now, what arguments did it get" β€” which aggregated metrics deliberately discard. Use Prometheus for paging and Flower for incident triage.

How do I trace a single request across the queue boundary? Propagate the OpenTelemetry trace context through the message itself: inject the traceparent into the payload when you enqueue, and extract it as the parent span context when the worker picks the job up. That stitches the producer's API span, the wait in the broker, and the worker's execution into one trace. The mechanics and Celery signal hooks are in Distributed Tracing for Async Jobs.

What scrape interval should I use for worker metrics? 15 seconds is the standard default and a good starting point β€” it balances resolution against storage and load. Drop to 10s only for very latency-sensitive alerting where 15s of detection delay matters. Critically, never poll your broker for counts faster than Prometheus scrapes them; align any in-app gauge-update loop to the scrape interval so you do not waste Redis round-trips producing data nobody reads.

How do I keep Prometheus from blowing up on cardinality? Treat every label as a cost multiplier and ban unbounded values from labels entirely. Queue name, task name, and status are bounded and safe. Job IDs, user IDs, customer names, and URLs are unbounded β€” each unique value spawns a new time series, and a few high-cardinality labels can multiply into millions of series and exhaust memory. Push that detail into structured logs and traces, which are built for high cardinality.

Related