Flower for Real-Time Celery Monitoring

Flower is the web-based, real-time monitoring tool for Celery, and this guide covers it as the live-inspection layer of Observability & Monitoring for Job Queues. Where Prometheus answers "is the system healthy over the last hour", Flower answers the question you actually have during an incident: "which task is stuck right now, what arguments did it receive, and which worker is running it".

Flower attaches to Celery's event stream β€” the same task-sent, task-started, task-succeeded, task-failed, task-retried events that exporters consume β€” and renders them as a live dashboard. It shows every active, scheduled, reserved, and recently completed task; the state and concurrency of every worker; and it exposes a REST API to inspect and control the fleet programmatically. It is the fastest way to get visibility into a Celery deployment, and it ships as a single pip install.

Problem Framing: The Live-State Gap

Aggregated metrics deliberately discard per-task detail to stay cheap and bounded β€” that is exactly why they cannot tell you which of the 4,000 in-flight jobs is the one hanging on a deadlocked row. During an incident you need the opposite of aggregation: the specific job ID, its exact arguments, its current runtime, and the worker hosting it, so you can decide whether to revoke it, restart the worker, or let it finish. Flower fills that live-state gap. It is not a replacement for metrics β€” it has almost no history and no real alerting β€” but it is the right tool for the triage minute.

Flower event-stream architecture Celery workers publish lifecycle events to the broker's event bus. Flower subscribes to that event stream, holds recent state in memory, and serves both a real-time web dashboard and a REST API. An operator views the dashboard and can issue control commands back through Flower to the workers. Celery Workers task-sent task-started task-succeeded Broker event bus Flower in-memory state Web UI REST API Operator triage / revoke events consume view control

Architecture & Running Flower

Flower runs as a standalone process that connects to the same broker as your workers. It needs task events enabled β€” without them it sees worker presence but no task lifecycle detail. Enable events on the worker side and start Flower against the broker URL.

# celery_app.py β€” Flower depends on the event stream being on
app.conf.update(
    worker_send_task_events=True,   # required for the Tasks view to populate
    task_send_sent_event=True,      # adds enqueue events so you see queue wait
)
# Run Flower against the broker. --persistent keeps state across restarts.
celery -A celery_app flower \
  --broker=redis://redis:6379/0 \
  --port=5555 \
  --persistent=True \
  --db=/var/lib/flower/flower.db \
  --max_tasks=10000        # cap retained task history to bound memory

In containers, run Flower as its own service so it scales and restarts independently of the workers:

# docker-compose.yml
services:
  flower:
    image: mher/flower:latest
    command: ["celery", "--broker=redis://redis:6379/0", "flower", "--port=5555"]
    ports: ["5555:5555"]
    environment:
      - FLOWER_PERSISTENT=True
      - FLOWER_MAX_TASKS=10000

The Tasks and Workers Views

The Tasks view is the core of Flower. It lists every task Flower has observed, filterable by state (active, succeeded, failed, retried, revoked), by name, and by worker. Each row drills into the task UUID, the arguments and keyword arguments it was called with, its runtime, the result or exception traceback, and the retry count. This is the view you live in during an incident: filter to FAILURE, read the traceback, copy the args, reproduce locally.

The Workers view shows each worker's status, the pool implementation and concurrency, the number of active and processed tasks, and per-worker load average. From here you can issue control commands without touching a shell β€” adjust a worker's concurrency with pool grow/pool shrink, rate-limit a task, or revoke a runaway job. The Broker view (Redis/RabbitMQ) shows queue lengths, which is your live backlog read.

The REST API and Programmatic Control

Everything in the UI is also a JSON endpoint, which makes Flower a control plane, not just a viewer. This is useful for runbooks and automation β€” for example, revoking all instances of a misbehaving task during an incident, or scripting a concurrency change.

# List active tasks as JSON
curl -s http://flower:5555/api/tasks?state=STARTED | jq '.[] | {uuid, name, runtime}'

# Revoke and terminate a stuck task by UUID
curl -X POST http://flower:5555/api/task/revoke/<task-uuid>?terminate=true

# Grow a worker's pool by 4 processes during a backlog spike
curl -X POST http://flower:5555/api/worker/pool/grow/worker1@host \
  -d 'n=4'
# A runbook step: revoke every queued instance of a known-bad task
import requests

tasks = requests.get("http://flower:5555/api/tasks", params={"state": "RECEIVED"}).json()
for uuid, t in tasks.items():
    if t["name"] == "app.tasks.broken_report":
        requests.post(f"http://flower:5555/api/task/revoke/{uuid}", params={"terminate": "true"})

Persistence and Its Limits

By default Flower holds all state in memory β€” restart it and the task history is gone. The --persistent=True --db=... flags write a local shelve database so history survives restarts, and --max_tasks caps how many task records are retained to bound memory and disk. This is enough for short-lived operational history, but it is emphatically not a metrics store: there is no downsampling, no long retention, no efficient time-range query, and the local DB does not survive the pod being rescheduled to a new node unless you mount durable storage.

# Persistent Flower with a mounted volume so history survives container rescheduling
celery -A celery_app flower \
  --persistent=True \
  --db=/data/flower.db \      # mount /data on durable storage
  --max_tasks=50000           # tune against available memory

Trade-off Analysis: Flower vs. Prometheus

| Capability | Flower | Prometheus + Grafana | ||---|---| | Per-task detail (args, traceback) | Yes, core strength | No, intentionally aggregated | | Real-time worker control (revoke, grow) | Yes, via UI/API | No | | Long-term history & trends | Minimal, ephemeral | Yes, weeks–months | | Percentile latency (p99) | No | Yes, via histograms | | Robust alerting | No | Yes, Alertmanager | | Multi-cluster / multi-broker | One broker per instance | Federated | | Setup cost | Trivial (pip install) | Moderate |

The two are complementary, not competing. Run Prometheus and Grafana for trends, percentiles, and paging, as covered in Prometheus Metrics for Workers; run Flower for the live, per-task triage that aggregated metrics structurally cannot provide.

Failure Modes & Recovery

Tasks view is empty. Events are not enabled. Recovery: set worker_send_task_events=True (and restart workers); confirm the workers, not just Flower, were redeployed with the setting.

Flower memory grows unbounded. No --max_tasks cap, so it retains every task forever. Recovery: set --max_tasks to a value sized to available memory, and enable persistence so the history is on disk rather than purely in RAM.

History lost on every deploy. Persistence off, or the DB on ephemeral container storage. Recovery: enable --persistent and mount the --db path on durable storage so a reschedule does not wipe it.

Flower exposed to the internet without auth. A wide-open Flower lets anyone revoke tasks and read job arguments β€” a serious exposure. Recovery: never run it unauthenticated in production; the full hardening procedure is in Securing the Flower dashboard in production.

Performance Tuning

Flower's cost is dominated by event volume and retained task count. On a very high-throughput fleet the event stream itself can become heavy; cap retention with --max_tasks, and consider running a dedicated Flower instance per broker rather than fanning one instance across many. Because the dashboard pushes live updates over WebSocket, keep the browser tab count modest on busy fleets. For anything beyond live triage β€” capacity planning, SLO tracking, alerting β€” push that load onto Prometheus rather than asking Flower to be a time-series database it was never designed to be.

FAQ

Is Flower a replacement for Prometheus and Grafana? No, and trying to use it as one leads to pain. Flower excels at live, per-task inspection and worker control but has only ephemeral history, no percentile latency, and no real alerting. Prometheus and Grafana give you trends, p99 latency, long retention, and paging. Run both: Flower for the incident triage minute, Prometheus for everything time-based.

Does Flower add load to my Celery workers? Indirectly and usually negligibly. Flower itself runs as a separate process, but it requires worker_send_task_events=True, which makes workers publish lifecycle events to the broker. On most fleets that overhead is tiny; on extreme-throughput fleets the event volume is worth measuring, and you can cap Flower's retained history with --max_tasks to keep its own footprint bounded.

How do I keep task history across restarts? Run Flower with --persistent=True and point --db at a path on durable storage, then bound it with --max_tasks. Without persistence all state is in memory and vanishes on restart; with persistence on ephemeral container storage it vanishes when the pod is rescheduled, so the volume must be durable.

Can I control workers from Flower or only watch them? You can control them. The Workers view and the REST API let you revoke and terminate tasks, grow or shrink a worker's pool, set rate limits, and shut workers down. That power is exactly why Flower must be authenticated in production β€” an open instance is a remote control for your entire job fleet.

Related