Prometheus Metrics for Workers

This guide goes deep on the metrics layer of Observability & Monitoring for Job Queues: how to design, expose, scrape, and alert on worker metrics with Prometheus so that queue depth, saturation, latency, and failure rate become first-class time series you can graph and page on.

Prometheus is a pull-based system. Your workers do not push anywhere; instead each worker (or a sidecar exporter beside it) exposes a plain-text /metrics HTTP endpoint, and the Prometheus server scrapes that endpoint on a fixed interval, storing every sample in its local time-series database. This model is robust β€” a worker that crashes simply stops being scraped, which is itself a signal β€” but it imposes two design constraints that shape everything else: metrics must be cheap to expose on every scrape, and label cardinality must stay bounded because every unique label set is a separate stored series.

Problem Framing

The naive instinct is to log everything and grep later. That fails for queue health because the questions you need answered in an incident β€” "is the backlog draining", "what is the p99 right now", "has the failure rate doubled" β€” are aggregate, time-windowed, and need to be answered in milliseconds while you are paging. Logs cannot do that at scale; pre-aggregated metrics can. The job here is to pick the right instrument type for each of the four queue signals, expose them efficiently, and write recording and alerting rules that turn raw counters into operational answers.

Choosing the right Prometheus instrument per queue signal Failure and retry rate map to counters queried with rate. Queue depth and worker saturation map to gauges. Task latency percentiles map to histograms. Each instrument type is shown with the queue signal it captures and the query function used to read it. Counter monotonic, only goes up Failure / retry rate jobs_total{status=...} read with rate() / increase() Gauge goes up and down Backlog + saturation queue_depth, inflight read with predict_linear() Histogram bucketed distribution Task latency p50 / p95 / p99 read with histogram_quantile()

Architecture: Pull, Scrape, Store

The data path is: worker process β†’ exporter /metrics endpoint β†’ Prometheus scrape β†’ TSDB β†’ recording/alert rule evaluation β†’ Grafana queries and Alertmanager pages. For a Celery fleet the topology looks like this:

 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   listens to    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  scrape /metrics  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ Celery worker│── task events ─▢│ celery-exporter │◀──── 15s ────────│ Prometheus β”‚
 β”‚ worker N ... β”‚   (broker bus)  β”‚  :9808/metrics  β”‚                   β”‚   TSDB     β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                                                                              β”‚ rules
                                                                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                                                  β–Ό                       β–Ό
                                                            Grafana panels         Alertmanager

The exporter exists because Celery has no native /metrics endpoint β€” it instead emits a stream of lifecycle events on the broker. The exporter subscribes to that stream and translates events into metrics. BullMQ and Sidekiq differ: their state lives in Redis and is read directly, so the "exporter" is a thin polling service you write or a gem you add. The scrape config below targets all three.

# prometheus.yml β€” scrape config for a mixed worker fleet
global:
  scrape_interval: 15s          # default cadence; do not poll brokers faster than this
  evaluation_interval: 15s      # how often recording/alert rules run

rule_files:
  - "rules/queue_recording.yml"
  - "rules/queue_alerts.yml"

scrape_configs:
  - job_name: "celery"
    static_configs:
      - targets: ["celery-exporter:9808"]   # one exporter per broker/cluster
  - job_name: "bullmq"
    static_configs:
      - targets: ["bull-metrics:3000"]       # your getJobCounts() polling service
  - job_name: "sidekiq"
    static_configs:
      - targets: ["sidekiq:9394"]            # yabeda-sidekiq default port

Implementation 1: prometheus_client Instrument Design

When you own the worker code, instrument it directly with the official prometheus_client. The art is choosing the right instrument per signal. The rule: counters for things that only go up (totals you take a rate() of), gauges for things that go up and down (current backlog, in-flight count), and histograms for distributions (the only way to get percentiles).

from prometheus_client import Counter, Gauge, Histogram, start_http_server

#  Failure / retry rate: a counter, sliced by terminal status ---
JOBS_TOTAL = Counter(
    "worker_jobs_total",
    "Jobs that reached a terminal state",
    ["queue", "task", "status"],     # status ∈ {success, failure, retry}
)

# --- Queue depth & saturation: gauges (instantaneous values) ---
QUEUE_DEPTH = Gauge("worker_queue_depth", "Jobs waiting, not yet picked up", ["queue"])
INFLIGHT = Gauge("worker_inflight_jobs", "Jobs currently executing", ["queue"])

# --- Latency: histograms, split into wait and execution ---
QUEUE_WAIT = Histogram(
    "worker_queue_wait_seconds",
    "Time from enqueue to dequeue (scheduling latency)",
    ["queue"],
    buckets=(0.01, 0.05, 0.1, 0.5, 1, 5, 15, 60, 300),
)
EXEC_TIME = Histogram(
    "worker_exec_seconds",
    "Time from dequeue to ack (handler latency)",
    ["queue", "task"],
    buckets=(0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300),
)

def process(job):
    QUEUE_WAIT.labels(job.queue).observe(job.dequeued_at - job.enqueued_at)
    INFLIGHT.labels(job.queue).inc()
    try:
        with EXEC_TIME.labels(job.queue, job.task).time():
            job.run()
        JOBS_TOTAL.labels(job.queue, job.task, "success").inc()
    except Retryable:
        JOBS_TOTAL.labels(job.queue, job.task, "retry").inc()
        raise
    except Exception:
        JOBS_TOTAL.labels(job.queue, job.task, "failure").inc()
        raise
    finally:
        INFLIGHT.labels(job.queue).dec()

if __name__ == "__main__":
    start_http_server(8000)   # exposes /metrics on :8000 for Prometheus to scrape

Two subtleties bite people in production. First, histogram buckets must match your real latency range β€” the library defaults top out around 10 seconds, so a 5-minute ETL job lands entirely in the +Inf bucket and histogram_quantile returns garbage. Set buckets that bracket your actual p50 through p99. Second, in a forked worker model (Celery prefork, Gunicorn) each child process has its own metrics registry; use prometheus_client's multiprocess mode with a shared PROMETHEUS_MULTIPROC_DIR, or scrape an exporter that aggregates across processes instead.

Implementation 2: celery-exporter (No Code Changes)

When you cannot or do not want to edit task code, celery-exporter gives you the four signals from Celery's event stream alone. Enable events on the workers, run the exporter as a sidecar, and point Prometheus at it.

# celery_app.py β€” turn on the event stream the exporter consumes
app.conf.update(
    worker_send_task_events=True,   # publish task-started / -succeeded / -failed / -retried
    task_send_sent_event=True,      # publish task-sent at enqueue (enables queue-wait metrics)
)
# docker-compose.yml β€” exporter as a sidecar against the same broker
services:
  celery-exporter:
    image: danihodovic/celery-exporter:latest
    command: ["--broker-url=redis://redis:6379/0"]
    ports: ["9808:9808"]

The exporter publishes celery_task_sent_total, celery_task_succeeded_total, celery_task_failed_total, celery_task_retried_total (counters, sliced by name and queue) and celery_task_runtime_seconds (a histogram). It also exposes celery_queue_length by reading the broker directly, which is your backlog gauge. This is the lowest-friction path to a fully instrumented Celery fleet; the end-to-end walkthrough is in Instrumenting Celery with a Prometheus exporter.

Recording Rules: Pre-Compute the Expensive Queries

Dashboards and alerts repeatedly evaluate the same rate() and histogram_quantile() expressions. Recording rules compute them once per evaluation interval and store the result as a new, cheap series, so every panel and alert reads a single value instead of re-aggregating thousands of raw samples.

# rules/queue_recording.yml
groups:
  - name: queue_derived
    interval: 15s
    rules:
      # Throughput: successful jobs per second, per queue
      - record: job:throughput:rate5m
        expr: sum(rate(celery_task_succeeded_total[5m])) by (queue)

      # Failure rate: failures as a fraction of completed jobs
      - record: job:failure_rate:5m
        expr: |
          sum(rate(celery_task_failed_total[5m])) by (queue)
          /
          (sum(rate(celery_task_succeeded_total[5m])) by (queue)
           + sum(rate(celery_task_failed_total[5m])) by (queue))

      # p99 execution latency, pre-computed from the runtime histogram
      - record: job:exec_p99:5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(celery_task_runtime_seconds_bucket[5m])) by (le, queue))

Trade-off Analysis: Instrument Strategies

Strategy Code changes Cardinality control Covers queue wait? Best when
Direct prometheus_client Yes, in task code Full, you own labels Yes, if you instrument it You control the workers and want custom labels
celery-exporter sidecar None (just enable events) Fixed by exporter Yes (task-sent β†’ started) Celery, fast rollout, no code edits
BullMQ getJobCounts() poller Small service Full Approximate (Redis counts) Node/BullMQ fleets
yabeda-sidekiq Add gem Good defaults Via queue latency Ruby/Sidekiq fleets

Failure Modes & Recovery

Cardinality explosion. Someone adds job_id (or user_id, or a URL) as a label. Each unique value becomes a permanent series; memory climbs until Prometheus OOMs. Recovery: find the offending metric with topk(10, count by (__name__)({__name__=~".+"})), drop the label at the exporter or with metric_relabel_configs, and never reintroduce unbounded labels.

Histogram percentiles look wrong (flat at the top bucket). Buckets do not cover the real latency range, so observations pile into +Inf and histogram_quantile interpolates nonsense. Recovery: widen the bucket boundaries to bracket your observed p50–p99 and redeploy; historical data keeps the old buckets, so the fix is forward-looking.

Counter resets on deploy read as negative spikes. A worker restart resets counters to zero. Recovery: this is expected β€” always wrap counters in rate() or increase(), which are reset-aware. Never alert on a raw counter value.

Stale gauge after a worker dies. A gauge like worker_inflight_jobs keeps its last value if the worker vanishes mid-job, hiding the loss. Recovery: prefer reading backlog from the broker (where the truth lives) over in-process gauges, and add an up == 0 alert so a missing scrape target itself pages.

Performance Tuning

Keep the scrape cheap: a /metrics response should render in single-digit milliseconds. The dominant cost is total series count (cardinality), not request volume, so the highest-leverage tuning is pruning labels. Use a 15s scrape_interval as the default and align any in-app polling (the BullMQ getJobCounts() loop) to it β€” polling Redis every second when you scrape every 15s just burns broker round-trips. For long retention, push to remote-write storage (Thanos, Mimir, Cortex) rather than growing local TSDB unbounded. Put expensive aggregations in recording rules so the read path stays fast under dashboard load. These derived series feed directly into the panels described in Grafana Dashboards for Queues and the backlog alerts in Alerting on queue backlog with Prometheus.

FAQ

Counter, gauge, or histogram β€” how do I choose for a job metric? Match the instrument to the question. "How many jobs failed since start" is a total that only increases, so it is a counter and you query it with rate(). "How many jobs are waiting right now" goes up and down, so it is a gauge. "What is the p99 execution time" is a distribution, which only a histogram can answer β€” no combination of counters and gauges reconstructs percentiles after the fact.

Why use the celery-exporter instead of instrumenting tasks directly? The exporter requires zero changes to task code β€” you flip on Celery's event stream and run a sidecar, and you immediately get throughput, failure, retry, and runtime metrics across every task. Direct instrumentation gives you finer control and custom labels but means editing and redeploying every worker. Most teams start with the exporter and add direct instrumentation only for the handful of tasks that need bespoke business metrics.

How do I expose metrics from a prefork Celery worker without losing per-child data? Each forked child has its own in-process registry, so a plain start_http_server only ever shows one child's numbers. Use prometheus_client's multiprocess mode with a shared PROMETHEUS_MULTIPROC_DIR so children write to a common directory that a single collector aggregates, or sidestep the problem entirely by using the celery-exporter, which reads the broker-level event stream rather than per-process state.

What scrape interval and retention should I run? Start at a 15s scrape interval and tune only with evidence β€” 10s for latency-critical alerting, slower for cheap background fleets. For retention, local TSDB is fine for weeks; beyond that, ship to remote-write storage like Thanos or Mimir rather than letting the local database grow without bound, and lean on recording rules so long-range queries stay affordable.

Related