Handling Async Boundaries in Node.js and Python

Problem Framing

An order service finishes processing a request and hands work off to an async queue consumer. In Jaeger, the resulting spans appear as two disconnected root traces rather than a single waterfall. The parent_id field on the consumer spans is empty. The application code looks correct — the OpenTelemetry SDK is initialised, headers are present in the message, but the consumer span has no parent. This is an async boundary context-loss failure, one of the most common sources of broken traces in production Node.js and Python services. The active span context lives in execution-local storage that is not automatically transferred when execution shifts to a new event-loop iteration, a thread-pool worker, or a message-queue consumer loop.


Prerequisites

Before working through the implementation steps below, ensure the following are in place:

  • Node.js 16+ (for stable AsyncLocalStorage) or Node.js 18+ (for AsyncLocalStorage.bind())
  • Python 3.7+ with contextvars available (standard library)
  • @opentelemetry/api ≥ 1.4 and @opentelemetry/sdk-node ≥ 1.4 for Node.js
  • opentelemetry-api ≥ 1.4 and opentelemetry-sdk-trace-base ≥ 1.4 for Python
  • A working OpenTelemetry SDK setup for backend services that initialises before any HTTP server or consumer starts
  • OTEL_PROPAGATORS=tracecontext,baggage set on every service in the call chain
  • Familiarity with SDK Implementation & Context Propagation concepts, in particular the inject/extract lifecycle

Concept Deep-Dive: How Async Execution Breaks Context

The W3C TraceContext specification assumes that the active span context travels with the call stack. When execution is synchronous, this works implicitly. The moment execution crosses an async boundary — a setTimeout, a Promise resolved on a different microtask tick, an asyncio.create_task(), or a thread-pool submission — the runtime creates a new execution context. Whether that new context inherits the parent’s tracing data depends entirely on how the language’s concurrency primitive copies (or fails to copy) thread-local or context-local storage.

The diagram below shows the two diverging paths: a propagated context (spans linked) vs a lost context (orphaned root span).

Async Boundary Context Propagation Paths Left path: context captured before boundary, task inherits parent span, spans linked in trace. Right path: context not captured, task starts new root span, orphaned trace in Jaeger. Async Boundary: Context Preserved vs Context Lost Request Handler — span A started, context active capture ctx = context.active() Async boundary (setTimeout / create_task / executor) asyncLocalStorage.run(ctx, …) Consumer task — span B, parent = span A ✓ Jaeger: one linked trace waterfall ✓ Context preserved Request Handler — span A started, context active context not captured before boundary Async boundary (setTimeout / create_task / executor) new execution context — no parent reference Consumer task — span B, parent = null ✗ Jaeger: two disconnected root spans ✗ Orphaned span

Node.js: AsyncLocalStorage mechanics

AsyncLocalStorage (ALS) is built on the async_hooks API. When asyncLocalStorage.run(store, callback) is called, Node.js creates an async context that all child async operations — promises, timers, I/O callbacks — inherit automatically, as long as they are initiated within that run call. The critical failure mode is creating an async operation before entering run(), or storing a reference to context.active() but never passing it through asyncLocalStorage.run() when the deferred work executes.

Python: contextvars mechanics

Python’s contextvars.ContextVar stores data per execution context. When asyncio.create_task() spawns a coroutine, Python copies a snapshot of the current context at that exact moment, so changes made inside the coroutine do not affect the parent — but crucially, if the parent context has no active OpenTelemetry span yet, the snapshot is empty. ThreadPoolExecutor submissions receive no context copy at all by default, which is why contextvars.copy_context().run(fn, *args) is the required pattern.


Step-by-Step Implementation

Step 1 — Node.js: Configure AsyncLocalStorage for the request lifecycle

Seed AsyncLocalStorage with the active OTel context at the start of every inbound request. This one wrapper ensures all middleware, route handlers, and deferred operations within the same request inherit the same context automatically.

const { AsyncLocalStorage } = require('async_hooks');
const { trace, context } = require('@opentelemetry/api');

const als = new AsyncLocalStorage();

// Express middleware — runs before all route handlers
app.use((req, res, next) => {
  const activeCtx = context.active(); // capture the propagated OTel context
  als.run(activeCtx, () => {
    // every async operation started inside next() inherits activeCtx
    next();
  });
});

// Any code anywhere in the request cycle can now read the context:
function getCurrentTraceId() {
  const ctx = als.getStore() ?? context.active();
  return trace.getSpan(ctx)?.spanContext().traceId;
}

Step 2 — Node.js: Restore context for worker threads and message queue consumers

Long-lived consumers — Kafka, RabbitMQ, BullMQ — run outside any request’s als.run() scope. Extract the W3C TraceContext headers from the message payload and re-enter als.run() with the restored context before creating any spans.

const { trace, propagation, context } = require('@opentelemetry/api');
const { AsyncLocalStorage } = require('async_hooks');

const als = new AsyncLocalStorage();

async function processKafkaMessage(message) {
  // message.headers carries traceparent / tracestate / baggage from the producer
  const carrier = Object.fromEntries(
    Object.entries(message.headers).map(([k, v]) => [k, v.toString()])
  );
  const extractedCtx = propagation.extract(context.active(), carrier);

  await als.run(extractedCtx, async () => {
    const tracer = trace.getTracer('kafka-consumer', '1.0.0');
    const span = tracer.startSpan('kafka.message.process', {}, extractedCtx);
    const spanCtx = trace.setSpan(extractedCtx, span);

    try {
      await context.with(spanCtx, () => handleBusinessLogic(message));
      span.setStatus({ code: 0 }); // SpanStatusCode.OK
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: 2, message: err.message }); // SpanStatusCode.ERROR
      throw err;
    } finally {
      span.end();
    }
  });
}

Step 3 — Node.js: Bind deferred callbacks explicitly

setTimeout and setInterval callbacks execute in new macrotask queue entries, after the current als.run() scope has returned. Use als.bind() to capture the current store reference at scheduling time.

const { AsyncLocalStorage } = require('async_hooks');
const { context } = require('@opentelemetry/api');

const als = new AsyncLocalStorage();

// Inside an als.run() block:
als.run(context.active(), () => {
  const store = als.getStore();

  // WRONG: setTimeout callback runs outside the als.run() scope
  setTimeout(() => {
    console.log(als.getStore()); // undefined — context lost
  }, 100);

  // CORRECT: bind the callback to the current store at scheduling time
  setTimeout(
    als.bind(() => {
      console.log(als.getStore()); // active context — propagated correctly
    }),
    100
  );
});

Step 4 — Python: Propagate context across asyncio tasks

Before calling asyncio.create_task(), capture context.get_current(). Pass it into the coroutine as a parameter and call context.attach() as the first action inside the task. Always pair every attach() with a detach() in a finally block to avoid context token leaks.

import asyncio
from opentelemetry import trace, context as otel_context

async def process_item(item: dict, parent_ctx: object) -> None:
    # Restore the parent OTel context in this new asyncio task
    token = otel_context.attach(parent_ctx)
    try:
        tracer = trace.get_tracer("worker", "1.0.0")
        with tracer.start_as_current_span("worker.process_item") as span:
            span.set_attribute("item.id", item["id"])
            await asyncio.sleep(0)  # simulate async I/O
    finally:
        otel_context.detach(token)  # always detach to prevent token leakage

async def request_handler(items: list) -> None:
    # Capture context BEFORE creating tasks — the snapshot at this point
    # includes any span that was started by the inbound request middleware.
    current_ctx = otel_context.get_current()

    tasks = [
        asyncio.create_task(process_item(item, current_ctx))
        for item in items
    ]
    await asyncio.gather(*tasks)

Step 5 — Python: Copy context into ThreadPoolExecutor workers

ThreadPoolExecutor workers receive no context copy by default. Wrap the submitted callable with contextvars.copy_context().run() to carry the full context snapshot — including the active span — into the worker thread.

import contextvars
from concurrent.futures import ThreadPoolExecutor
from opentelemetry import trace, context as otel_context

def cpu_bound_task(data: bytes, ctx_snapshot: contextvars.Context) -> bytes:
    # ctx_snapshot.run() sets all ContextVars from the snapshot for this thread
    def _inner():
        token = otel_context.attach(otel_context.get_current())
        try:
            tracer = trace.get_tracer("cpu-worker")
            with tracer.start_as_current_span("cpu.task"):
                return process_data(data)  # actual CPU work
        finally:
            otel_context.detach(token)

    return ctx_snapshot.run(_inner)

async def handle_upload(data: bytes) -> bytes:
    ctx_snapshot = contextvars.copy_context()  # copy BEFORE executor.submit
    loop = asyncio.get_running_loop()
    with ThreadPoolExecutor() as pool:
        result = await loop.run_in_executor(
            pool,
            cpu_bound_task,
            data,
            ctx_snapshot,
        )
    return result

Step 6 — Python: Serialize context for subprocess and multiprocessing boundaries

When spawning a child process, contextvars cannot transfer across fork/exec. Encode the active W3C TraceContext headers into environment variables and read them back in the child.

import os
import subprocess
from opentelemetry.propagate import inject, extract
from opentelemetry import context as otel_context

def spawn_child_worker(payload_path: str) -> None:
    headers: dict = {}
    inject(headers)          # encodes traceparent, tracestate, baggage

    env = os.environ.copy()
    env.update(headers)      # child reads these as env vars

    subprocess.run(
        ["python", "child_worker.py", payload_path],
        env=env,
        check=True,
    )

# --- In child_worker.py ---
import os
from opentelemetry.propagate import extract
from opentelemetry import context as otel_context, trace

def restore_parent_context() -> None:
    # Re-assemble a carrier dict from environment variables
    carrier = {k: v for k, v in os.environ.items() if k.lower().startswith("trace")}
    ctx = extract(carrier)
    otel_context.attach(ctx)

Verification

After deploying the fixes, verify that context propagation is intact across every boundary you modified.

Query in Jaeger UI: Open the trace for a request that crosses the async boundary. Every span should show a populated parentSpanId. If any span appears as a root when it should be a child, the extract() call at that boundary is not restoring the context correctly.

CLI verification with OTLP output — export a single trace to stdout and inspect the JSON:

# Set the exporter to console output temporarily
OTEL_TRACES_EXPORTER=console node server.js

Look for "parentSpanId": "..." in the console output for every consumer span. An empty string indicates context loss.

Python context introspection — add this at any async boundary to confirm the context is active:

import contextvars
from opentelemetry import trace, context as otel_context

def debug_context(label: str) -> None:
    ctx_items = list(contextvars.copy_context().items())
    span = trace.get_current_span()
    span_ctx = span.get_span_context()
    print(f"[{label}] trace_id={span_ctx.trace_id:#018x} parent_valid={span_ctx.is_valid}")
    print(f"[{label}] context vars: {[str(k) for k, _ in ctx_items]}")

Expected trace output for a healthy async boundary crossing:

{
  "name": "kafka.message.process",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "00f067aa0ba902b7",
  "parentSpanId": "a3ce929d0e0e4736",
  "startTimeUnixNano": "1702300800000000000",
  "endTimeUnixNano": "1702300800045000000",
  "status": { "code": "STATUS_CODE_OK" }
}

Edge Cases and Gotchas

  1. Unawaited promises in Node.js. A promise that is created but not awaited executes outside the current als.run() scope’s lifetime. If you fire-and-forget with someAsyncFn() instead of await someAsyncFn(), any spans created inside that function will lose their parent. Either always await or explicitly bind using als.bind().

  2. Python asyncio.ensure_future(). This alias for create_task() behaves identically — it snapshots context at call time, so capture context.get_current() before calling it, not inside the coroutine.

  3. Django ORM in async views. Django’s ORM is synchronous. When called from an async view using sync_to_async, the sync wrapper runs in a thread pool. Apply the contextvars.copy_context().run() pattern inside the sync_to_async wrapper, or rely on asgiref ≥ 3.7 which copies context automatically. Pin asgiref to avoid silent regressions on downgrade.

  4. Starlette/FastAPI background tasks. BackgroundTasks.add_task() captures no context. Wrap the submitted callable with a closure that calls context.attach(parent_ctx) before the real work begins.

  5. Context token double-detach. Each call to context.attach() in Python returns a unique token. Calling context.detach(token) twice raises a RuntimeError. Use a single try / finally block per attach() call — never store tokens in a list and detach them in bulk.

  6. Long-lived AsyncLocalStorage stores. Storing large request objects (full req, full parsed body) inside ALS causes the entire object to be retained for the lifetime of all child async operations. Store only lightweight identifiers — traceId, spanId, userId — to avoid GC pressure.

  7. Promise.all with mixed contexts. When parallel promises inside Promise.all were each started in different als.run() scopes, their context references diverge. Ensure all parallel work is initiated inside a single als.run() block so they share the same store.

  8. gRPC streaming in Node.js. gRPC streaming calls persist across multiple event loop ticks. The initial als.run() scope from the first message frame will have exited by the time subsequent frames arrive. Extract and re-attach context for each individual streaming message, just as with Kafka.


Performance and Scale Notes

ALS memory overhead. Each active als.run() scope retains a reference to the stored object. In high-throughput Node.js services handling 10 000 req/s, keeping only primitive identifiers (strings, numbers) in the store rather than full request objects eliminates a significant class of GC pause. Benchmark with --expose-gc and call global.gc() between load-test iterations to measure retained heap.

Python context snapshot cost. contextvars.copy_context() performs a shallow copy of all active ContextVar entries. For services with many active context variables, this copy happens once per task submission. It is typically sub-microsecond with fewer than 20 context variables, but grows linearly. Audit active ContextVar instances with contextvars.copy_context().items() and remove stale ones.

Batch processor tuning. When spans from async workers are exported in batch, the BatchSpanProcessor defaults (maxQueueSize=2048, scheduledDelayMillis=5000) may cause delayed export for high-volume consumers. Tune for lower latency at the cost of slightly higher CPU:

# OTEL_BSP_* environment variables (applies to both Node.js and Python)
OTEL_BSP_MAX_QUEUE_SIZE: 4096
OTEL_BSP_SCHEDULED_DELAY: 1000
OTEL_BSP_MAX_EXPORT_BATCH_SIZE: 512

Sampling interaction. When using head-based sampling, the sampling decision is made at the root span. All child spans — including those created across async boundaries — inherit the decision via TraceFlags. Ensure OTEL_TRACES_SAMPLER=parentbased_traceidratio is set so consumers honour the producer’s decision and do not independently re-sample, which would break the parent-child link.


Troubleshooting FAQ

Q: Why do spans lose their parent_id across an asyncio.create_task() call?

asyncio.create_task() snapshots the current context at call time. If the parent span was started after the create_task() call, or if context.attach() was called after the snapshot was taken, the task’s copy of the context does not include the parent span. Capture context.get_current() before the create_task() call, then call context.attach(parent_ctx) as the first line inside the coroutine.

Q: My Express middleware uses AsyncLocalStorage but context is missing inside a setTimeout callback. Why?

setTimeout schedules its callback in the macrotask queue after the current synchronous execution completes. If asyncLocalStorage.run() has already returned by then, the store is no longer active for that callback. Wrap the callback explicitly with als.run(store, callback) or use als.bind(callback) at scheduling time to capture the store reference before the scope exits.

Q: How do I propagate trace context through a Python ThreadPoolExecutor?

ThreadPoolExecutor workers receive no context by default. Use contextvars.copy_context().run(fn, *args) as the callable submitted to the executor. This copies the full context snapshot — including the active OTel span — into the worker thread’s environment. See Step 5 above for the complete pattern.

Q: What is the safest way to pass baggage across a Kafka consumer boundary in Node.js?

Inject both trace context and baggage into the Kafka message headers at produce time using propagation.inject(context.active(), headers). At consume time, call propagation.extract(context.active(), headers) before creating any spans, then run all span work inside als.run(extractedCtx, ...). This ensures the full context — trace headers and baggage — is restored for the entire consumer processing chain.

Q: After fixing context propagation, some spans still appear as roots in Jaeger. What should I check?

Verify that OTEL_PROPAGATORS is set identically on every service — a mismatch between a producer emitting W3C traceparent and a consumer configured for B3 headers causes extract() to return an empty context, making the new span a root. Also confirm the SDK is initialised before any HTTP server or queue consumer binds; spans created before SDK init cannot attach to a propagated context.


↑ Back to SDK Implementation & Context Propagation