Distributed Tracing for Async Jobs

Distributed tracing is the only telemetry that follows a single request across the asynchronous boundary into a queue and back out on a worker, and this guide covers it as the causal-path layer of Observability & Monitoring for Job Queues. Metrics tell you the p99 is slow; a trace tells you which span inside that one slow job consumed the time.

The defining challenge of tracing a queue is the broken causal chain. A synchronous trace runs continuously through in-process calls, but when a producer enqueues a job the request returns immediately, and a fresh, unrelated trace would normally begin when a worker picks the job up — often seconds later, on a different host, in a different process. Without intervention you get two disconnected traces and lose exactly the handoff you most want to see. OpenTelemetry solves this by carrying the trace context inside the message, so the worker's execution span becomes a child of the producer's enqueue span and the whole lifecycle reads as one trace.

Problem Framing: The Broken Causal Chain

Consider a user clicking "export". The API handler validates the request, enqueues a job, and returns 202 in 40ms — the user is happy. Twelve seconds later a worker runs the export, which takes 8 seconds because a database query is slow. Metrics show a healthy API and a slow worker as two unrelated facts. Logs show two unrelated log streams. Only a trace that spans both — API request, the 12-second wait in the broker, the 8-second execution with its slow DB span — reveals the full story and points at the actual culprit. Reconstructing that requires propagating context across the broker.

Architecture: Context Propagation Through the Broker

OpenTelemetry represents the active trace as a context and serialises it with a propagator into a small set of headers — the W3C Trace Context standard, whose key field is traceparent (carrying the trace ID, the parent span ID, and sampling flags). For HTTP this rides in request headers automatically. For a queue there are no HTTP headers, so you inject the context into the message payload at enqueue and extract it at dequeue.

 Producer process                 Broker                  Worker process
 ┌──────────────┐  inject()  ┌──────────────┐ extract() ┌──────────────┐
 │ active context│──────────▶│ message body │──────────▶│ parent context│
 │ traceparent   │  into msg  │ {_otel: ...} │ from msg  │  → execute span│
 └──────────────┘            └──────────────┘           └──────────────┘

The two primitives are inject(carrier), which writes the current context into a dict-like carrier, and extract(carrier), which reconstructs a context from one. The carrier is just a string-keyed dict you stash in the message — Celery headers, a BullMQ job data field, a RabbitMQ message header. Configure a global propagator once so every service agrees on the format.

# otel_setup.py — run once at process start, producer and worker alike
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.propagate import set_global_textmap
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

set_global_textmap(TraceContextTextMapPropagator())   # W3C traceparent, the interop default
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)

Implementation 1: Manual Inject/Extract

The explicit version makes the mechanism obvious and works with any broker. Inject at enqueue, extract at dequeue, and bracket each side in a span.

from opentelemetry import trace
from opentelemetry.propagate import inject, extract
from opentelemetry.trace import SpanKind

tracer = trace.get_tracer("jobs")

def enqueue(broker, payload: dict):
    # PRODUCER (span kind = PRODUCER marks the async send)
    with tracer.start_as_current_span("job.enqueue", kind=SpanKind.PRODUCER) as span:
        span.set_attribute("messaging.system", "celery")
        span.set_attribute("messaging.destination", "exports")
        carrier = {}
        inject(carrier)               # serialise current context into the carrier
        payload["_otel"] = carrier    # ride along inside the message body
        broker.send(payload)

def on_message(payload: dict):
    # CONSUMER — rebuild the producer context as the parent of the execute span
    ctx = extract(payload.get("_otel", {}))
    with tracer.start_as_current_span("job.execute", context=ctx, kind=SpanKind.CONSUMER) as span:
        span.set_attribute("messaging.operation", "process")
        handle(payload)               # nested DB/HTTP spans auto-attach to this trace

Setting SpanKind.PRODUCER on enqueue and SpanKind.CONSUMER on execute lets tracing backends render the async link correctly and compute the queue-wait gap between them — the visible space where your job sat idle in the broker.

Implementation 2: Auto-Instrumentation

For Celery, the OpenTelemetry instrumentation does inject/extract for you via the task signals, so you do not hand-edit every task. Enable it once and existing tasks become traced.

# Auto-instrument Celery: hooks task signals to inject at send, extract at run
from opentelemetry.instrumentation.celery import CeleryInstrumentor
from celery.signals import worker_process_init

@worker_process_init.connect(weak=False)
def init_tracing(*args, **kwargs):
    CeleryInstrumentor().instrument()   # must run per worker process (prefork)

The same exists for many ecosystems — instrument the broker client and downstream libraries (database driver, HTTP client) so child spans attach automatically inside the execute span. The exact Celery signal wiring, including the prefork gotcha, is in Propagating trace context through Celery tasks.

Sampling: Keeping Cost Bounded

Tracing every job at full volume is expensive to export and store. Sampling decides which traces to keep. Head-based sampling decides at the root span using a fixed probability — cheap, but it may drop the rare error trace you actually needed. Tail-based sampling buffers complete traces in the OpenTelemetry Collector and decides after the fact, which lets you always keep traces containing an error or exceeding a latency threshold while sampling the boring successes at a low rate. For job queues, tail-based sampling with an error-and-slow policy is the high-value default.

# otel-collector tail-sampling: keep all errors + slow jobs, 5% of the rest
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 5000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Trade-off Analysis: Propagation and Backends

| Decision | Option A | Option B | Guidance | ||---|---|---| | Context transport | In message body (_otel) | Broker headers | Headers if the broker supports them cleanly; body works everywhere | | Instrumentation | Manual inject/extract | Auto-instrumentation | Auto for coverage; manual for custom attributes and odd brokers | | Sampling | Head-based (at root) | Tail-based (Collector) | Tail-based — keep all errors and slow jobs, sample the rest | | Backend | Jaeger | Grafana Tempo | Tempo if you live in Grafana; Jaeger for a standalone trace UI |

Both Jaeger and Tempo ingest OTLP, so the choice is operational, not protocol. Tempo stores traces cheaply in object storage and links directly from Grafana panels, which pairs naturally with the dashboards in Grafana Dashboards for Queues.

Failure Modes & Recovery

Two disconnected traces instead of one. Context is not being propagated — inject ran but extract did not, or the propagator differs between services. Recovery: confirm both processes call set_global_textmap with the same propagator, and that _otel is actually present in the received payload.

Context lost in prefork Celery. CeleryInstrumentor().instrument() ran in the parent, not the forked children. Recovery: instrument inside worker_process_init so each child process is wired (detailed in the Celery propagation guide).

Trace volume overwhelms the backend. Full sampling at high throughput. Recovery: move to tail-based sampling that keeps errors and slow traces while sampling successes at a low percentage.

No child spans inside execute. Downstream libraries (DB driver, HTTP client) are not instrumented, so the execute span is opaque. Recovery: enable the relevant auto-instrumentations so DB and HTTP calls attach as child spans — without them the trace shows duration but not where it went.

Performance Tuning

The dominant costs are span export and storage, both governed by sampling and batching. Use BatchSpanProcessor (not the simple processor) so spans export in batches rather than one network call per span. Tune the tail-sampling decision_wait to comfortably exceed your p99 trace duration so complete traces are buffered before the keep/drop decision. Keep span attributes lean — a job ID and queue name are useful; dumping the whole payload bloats every trace. Run the Collector as a gateway so individual workers do not each hold sampling buffers, and let it carry the propagated trace_id into your structured logs so a metric alert links straight to the relevant trace and its log lines, closing the loop across all three telemetry types.

FAQ

How does a trace survive crossing the queue when the request has already returned? By carrying the trace context inside the message itself. At enqueue you inject() the active context — the traceparent with its trace ID and parent span ID — into the payload; at dequeue the worker extract()s it and starts the execute span with that context as its parent. The execute span therefore shares the trace ID of the original request, so the backend stitches them into one trace even though the producer finished seconds earlier.

Should I propagate context in the message body or in broker headers? Broker headers are cleaner when the broker exposes them as a first-class, queryable concept (RabbitMQ message headers, for instance), because the trace context stays out of your business payload. Putting it in the body — a small _otel field — works with every broker uniformly, which is why it is the safe default. Either way you are moving the same handful of traceparent bytes; pick the one that fits your broker.

Won't tracing every job be too expensive? Tracing every job at full retention is expensive, which is what sampling is for. Tail-based sampling in the Collector is the queue-friendly choice: it buffers complete traces and keeps the ones that matter — anything with an error or above a latency threshold — while retaining only a small percentage of routine successes. You get full fidelity on the traces you would actually open during an incident and pay little for the rest.

Jaeger or Tempo for storing traces? Both speak OTLP, so it is an operational choice rather than a compatibility one. Choose Grafana Tempo if your team already lives in Grafana — it stores traces cheaply in object storage and links directly from dashboard panels, so a latency spike on a graph is one click from the trace. Choose Jaeger if you want a dedicated, standalone trace-exploration UI independent of Grafana.

Propagating trace context through Celery tasks — the exact Celery signal wiring and the prefork gotcha.
Observability & Monitoring for Job Queues — where tracing fits alongside metrics and logs.
Grafana Dashboards for Queues — linking from a latency panel straight to the trace.
Prometheus Metrics for Workers — the metrics that tell you when to go pull a trace.
Celery Architecture & Configuration — the task lifecycle the spans bracket.