Building Grafana Dashboards for Job Queues

Grafana is where the four signals of queue health become a picture an engineer can read at a glance, and this guide covers it as the visualisation layer of Observability & Monitoring for Job Queues. Prometheus stores the time series; Grafana turns them into the dashboard you stare at during an incident and the panels that drive your alerts.

A good queue dashboard is not a wall of every metric you collect — it is a deliberate, top-to-bottom narrative: backlog and trajectory at the top (is the system keeping up), throughput and saturation in the middle (why), and latency percentiles and failure rate below (what users feel). This guide builds that layout against Prometheus data, makes it reusable across every queue with template variables, and wires panel-level alerts so the dashboard and the paging come from one source of truth.

Problem Framing: From Raw Series to a Readable Story

The metrics from Prometheus Metrics for Workers are correct but unreadable in raw form — a hundred rate() queries across dozens of queues is noise. The dashboard's job is to compress that into the few panels that answer the operator's real questions in order: Are we falling behind? How fast? Is it a throughput problem or a downstream problem? Which queue? Get the panel choice and ordering wrong and the dashboard becomes decoration nobody opens during an incident.

Recommended Grafana queue dashboard layout A top-to-bottom dashboard narrative. A queue template variable at the top filters every panel. The first row shows backlog depth and projected time-to-drain. The middle row shows throughput and worker saturation. The bottom row shows p50/p95/p99 latency and failure rate. Variable: $queue = default ā–¾ filters every panel below Backlog depth queue_depth — are we keeping up? Projected time-to-drain predict_linear — how fast? Throughput rate(jobs) — jobs/sec by status Worker saturation inflight / capacity — why? Latency p50 / p95 / p99 histogram_quantile — what users feel Failure rate failures / completed — >5% alerts

Data Sources

Grafana reads from Prometheus over HTTP. Provision the data source as code so dashboards are reproducible and never depend on someone clicking through the UI. For BullMQ fleets where you scrape your own getJobCounts() poller, the same Prometheus data source serves those series too.

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"      # match your scrape_interval so rate() ranges line up
      httpMethod: POST          # POST handles long PromQL queries that overflow GET

Panels for Depth, Throughput, and Latency

Each row of the dashboard is one or two panels backed by a PromQL query. Lean on the recording rules from Prometheus Metrics for Workers so panels read cheap pre-computed series.

Backlog and trajectory (top row). A time-series panel of current depth, plus a stat panel projecting time-to-drain.

# Backlog depth for the selected queue (time-series panel)
queue_depth{queue="$queue"}

# Projected seconds to drain at the current net completion rate (stat panel)
queue_depth{queue="$queue"}
  / clamp_min(job:throughput:rate5m{queue="$queue"}, 0.001)

Throughput and saturation (middle row). Stack throughput by status so a retry storm is visually obvious — a growing red band of failures under a flat green band of successes.

# Jobs/sec by terminal status — stacked series panel
sum(rate(worker_jobs_total{queue="$queue"}[5m])) by (status)

# Worker saturation as a fraction of capacity (0–1 gauge panel)
sum(worker_inflight_jobs{queue="$queue"})
  / sum(worker_pool_size{queue="$queue"})

Latency percentiles (bottom row). Plot p50, p95, and p99 on one panel so the gap between typical and tail experience is visible at a glance.

# p50 / p95 / p99 execution latency from the histogram (one query per series)
histogram_quantile(0.50, sum(rate(worker_exec_seconds_bucket{queue="$queue"}[5m])) by (le))
histogram_quantile(0.95, sum(rate(worker_exec_seconds_bucket{queue="$queue"}[5m])) by (le))
histogram_quantile(0.99, sum(rate(worker_exec_seconds_bucket{queue="$queue"}[5m])) by (le))

Template Variables for Per-Queue Drill-Down

Hardcoding queue="default" into every panel means a new dashboard per queue — unmaintainable. A template variable turns one dashboard into a reusable view across every queue, with a dropdown that rewrites all panels at once. Define the variable as a label_values query so it auto-populates from the metrics themselves.

{
  "templating": {
    "list": [
      {
        "name": "queue",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(queue_depth, queue)",
        "refresh": 2,
        "includeAll": true,
        "multi": true,
        "sort": 1
      }
    ]
  }
}

With multi and includeAll enabled, panels should aggregate over the selection — use =~"$queue" (regex match) rather than ="$queue" in the PromQL so selecting multiple queues or "All" works correctly. This single change is what makes a queue dashboard scale to a fleet without per-queue duplication.

Dashboard-Managed Alerting

Grafana can own alert rules alongside the panels they visualise, so the threshold you see drawn on a graph is the threshold that pages. Define the rule against the same query and provision it as code.

# grafana/provisioning/alerting/queue.yml — backlog growth alert tied to the depth panel
apiVersion: 1
groups:
  - name: queue_health
    folder: Queues
    interval: 1m
    rules:
      - title: BacklogWillBreachDrainSLO
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: |
                queue_depth{queue=~".+"}
                  / clamp_min(job:throughput:rate5m, 0.001) > 600
          - refId: C
            type: threshold
            model: { conditions: [ { evaluator: { type: gt, params: [0] } } ] }
        for: 5m                 # require 5m sustained before paging — suppresses flapping
        labels: { severity: page }
        annotations:
          summary: "Queue {{ $labels.queue }} will not drain within the 10-minute SLO"

Whether Grafana or Alertmanager owns alerting is a team choice — Alertmanager centralises routing across many sources, while Grafana-managed alerts keep the rule visually next to its panel. The deeper backlog-alert design and Alertmanager routing live in Alerting on queue backlog with Prometheus.

Trade-off Analysis: Alerting Location and Panel Choices

| Decision | Option A | Option B | Guidance | ||---|---|---| | Alert ownership | Grafana-managed | Alertmanager rules | Alertmanager for fleet-wide routing; Grafana when rule-next-to-panel matters | | Latency panel | Average gauge | Percentile time-series | Always percentiles — averages hide the tail | | Backlog panel | Raw depth | Depth + time-to-drain | Add time-to-drain; raw depth lacks trajectory | | Per-queue views | One dashboard each | One templated dashboard | Templating; per-queue dashboards rot | | Throughput panel | Single total line | Stacked by status | Stacked — exposes retry storms instantly |

Failure Modes & Recovery

Latency panel shows a flat line at the top bucket. The histogram buckets do not cover real latencies, so histogram_quantile interpolates from the +Inf bucket. Recovery: fix the buckets in the worker instrumentation (see Prometheus Metrics for Workers); Grafana can only render what the histogram captured.

Template variable dropdown is empty. The label_values query targets a metric or label that does not exist, or the data source UID is wrong. Recovery: run the label_values(...) query in Explore to confirm it returns values, and verify the variable's datasource matches the provisioned UID.

Multi-select breaks panels. Panels use ="$queue" (exact match) while the variable is multi-value. Recovery: switch every query to =~"$queue" so the regex match handles multiple selections and the "All" option.

Dashboard edits lost on redeploy. UI edits are not in the provisioned JSON. Recovery: treat provisioned dashboards as read-only, export changes back to the JSON in source control, and redeploy — never hand-edit production dashboards as the source of truth.

Performance Tuning

Heavy dashboards stress Prometheus, not Grafana. The fix is to push aggregation into recording rules so panels read a single pre-computed series instead of recomputing rate() and histogram_quantile() over millions of raw samples on every refresh. Set the dashboard's minimum interval to your scrape interval (15s) so panels never request finer resolution than the data supports. Cap auto-refresh — a 10s refresh on a 15s scrape just hammers Prometheus for data that has not changed; 30s–1m is plenty for a wall display. For very wide fleets, use one templated dashboard with multi-select rather than dozens of static ones, and put the busiest panels behind $queue so an operator loads only what they are looking at. A BullMQ-specific build of this dashboard, panel by panel, is in Building a BullMQ Grafana dashboard.

FAQ

Should alerts live in Grafana or in Prometheus Alertmanager? Both work; pick by how your team routes pages. Alertmanager centralises routing, silencing, and deduplication across every alert source, which is the right home once you have more than a couple of services. Grafana-managed alerts keep the rule visually attached to the panel it watches, which some teams prefer for queue dashboards. The threshold logic is the same either way — what differs is where routing and on-call configuration live.

How do I make one dashboard work for every queue? Use a template variable defined as label_values(queue_depth, queue) so the dropdown auto-populates from your metrics, then reference it as =~"$queue" (regex match, not exact match) in every panel query. With multi-select and "Include All" enabled, one dashboard then covers the whole fleet, and a new queue appears in the dropdown automatically the moment it emits a metric.

Why are my p99 latency panels flat or obviously wrong? Almost always the histogram buckets in the worker instrumentation do not span the real latency range, so every slow observation lands in the +Inf bucket and histogram_quantile cannot interpolate a meaningful value. Grafana renders faithfully what the histogram captured — the fix is in the instrumentation, by setting bucket boundaries that bracket your actual p50 through p99.

Can Grafana read BullMQ or Sidekiq metrics, not just Celery? Yes — Grafana is framework-agnostic because it only reads Prometheus. As long as something exposes the series (the celery-exporter, a BullMQ getJobCounts() poller, or yabeda-sidekiq), the same panels and template variables work. You typically just adjust the metric names in the panel queries to match each exporter's naming.

Related