Auto-Instrumentation vs Manual Span Creation

Q: My spans show status OK but exceptions are not recorded — why?

Setting StatusCode.OK before the exception is raised overwrites the error state. Always call span.record_exception(e) before span.set_status(StatusCode.ERROR). With Python context managers, an unhandled exception inside the with block records and sets error status automatically; explicit span handles need manual exception recording.

Q: How do I prevent the BatchSpanProcessor from dropping spans under load?

Tune max_queue_size (default 2048) and max_export_batch_size (default 512) upward proportional to your peak RPS. Monitor the otel_bsp_dropped_spans metric emitted by the SDK. If the exporter endpoint latency exceeds schedule_delay_millis (default 5000 ms), spans accumulate faster than they are flushed — increase the delay or move the exporter to a local OpenTelemetry Collector sidecar.

Problem Framing

An OpenTelemetry deployment is live, the exporter reaches the backend, and the dashboard shows spans — yet incident investigations repeatedly dead-end. HTTP entry spans exist but the database call three hops in carries no db.statement attribute. A payment workflow generates a root span with no children even though the processor ran. A Celery worker’s spans appear disconnected from the HTTP request that triggered them. The symptoms share one cause: the gap between what framework hooks observe automatically and what custom business logic actually executes. Choosing incorrectly between auto-instrumentation and manual span creation — or failing to connect the two — leaves exactly those gaps that matter most during an outage.

Prerequisites

OpenTelemetry SDK Setup for Backend Services completed: TracerProvider, resource attributes, and exporter configured.
OpenTelemetry SDK version 1.20+ (Python) or @opentelemetry/sdk-node 0.45+ (Node.js).
A running Jaeger or Tempo backend, or an OTLP-compatible collector endpoint.
Familiarity with span lifecycle and parent-child relationships — specifically how context objects scope parent references.

Concept Deep-Dive: The Span Lifecycle and Context Slot

Every span in OpenTelemetry follows a state machine: Started → Active → Ended → Exported. A span becomes “active” by being attached to the current context slot — a thread-local or async-local storage key that descendant code reads when it needs a parent reference. Auto-instrumentation hooks operate at framework entry points (HTTP handler dispatch, gRPC method execution, database driver calls) and manage this context slot automatically. Manual span creation reads the same slot to establish the parent, then writes a new span into it for the duration of its with block or explicit lifecycle.

The diagram below shows how the context slot connects both approaches within a single request.

The key insight: both approaches write to and read from the same context slot. Auto-instrumentation owns the entry and framework spans; manual spans attach as children by reading that inherited context. Disconnect them — for example by crossing a thread boundary without propagating the context — and child spans become orphaned root spans with new, unrelated trace IDs.

Step-by-Step Implementation

Step 1 — Attach the Auto-Instrumentation Agent

Python uses the opentelemetry-instrument entry-point wrapper. Install the SDK plus any framework-specific packages:

pip install opentelemetry-sdk \
            opentelemetry-exporter-otlp-proto-grpc \
            opentelemetry-instrumentation-fastapi \
            opentelemetry-instrumentation-sqlalchemy \
            opentelemetry-instrumentation-httpx

Launch with the agent:

OTEL_SERVICE_NAME=checkout-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
OTEL_PROPAGATORS=tracecontext,baggage \
opentelemetry-instrument uvicorn app.main:app --host 0.0.0.0 --port 8000

Node.js initialises the SDK programmatically before any require/import of application code:

// tracing.js — must be required FIRST via --require flag
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-proto');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: 'http://localhost:4318/v1/traces' }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

process.on('SIGTERM', () => sdk.shutdown());

node --require ./tracing.js server.js

Step 2 — Verify Baseline Coverage Before Adding Manual Spans

Query Jaeger or Tempo to confirm framework spans arrive with the correct structure:

# Jaeger HTTP API: find recent traces for the service
curl -s "http://localhost:16686/api/traces?service=checkout-service&limit=5" \
  | jq '.data[0].spans[] | {operationName, tags: (.tags | map(select(.key=="span.kind")))}'

Expected output confirms HTTP server spans with span.kind: server:

{ "operationName": "POST /checkout", "tags": [{ "key": "span.kind", "value": "server" }] }
{ "operationName": "SELECT", "tags": [{ "key": "span.kind", "value": "client" }] }

If no spans appear at all, check that OTEL_EXPORTER_OTLP_ENDPOINT resolves and the collector is listening. If spans appear but lack database children, verify the SQLAlchemy or pg instrumentation package is installed and the instrumentation is not filtered by OTEL_PYTHON_EXCLUDED_URLS.

Step 3 — Add Manual Spans for Business Logic

Obtain a tracer from the global provider. The tracer name should identify the module, not the service (the service name is already in the resource).

Python — using context manager (preferred for synchronous code):

import opentelemetry.trace as trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer(__name__)

def process_payment(order_id: str, amount: float) -> dict:
    # start_as_current_span writes the new span into the context slot
    with tracer.start_as_current_span("process_payment") as span:
        # Attach business attributes using semantic conventions where applicable
        span.set_attribute("order.id", order_id)
        span.set_attribute("payment.amount_usd", amount)
        span.set_attribute("payment.processor", "stripe")

        try:
            result = _call_stripe_api(order_id, amount)
            span.set_attribute("payment.transaction_id", result["id"])
            span.set_status(StatusCode.OK)
            return result
        except StripeError as exc:
            # record_exception captures stack trace; set_status marks the span ERROR
            span.record_exception(exc)
            span.set_status(StatusCode.ERROR, description=str(exc))
            raise

Python — explicit handle required for async hand-off across boundaries:

async def enqueue_fulfilment(order_id: str) -> None:
    # Capture the span explicitly so it can be ended after the await
    span = tracer.start_span("enqueue_fulfilment")
    ctx = trace.use_span(span, end_on_exit=False)
    token = context_api.attach(ctx)
    try:
        await message_queue.publish("fulfilment", {"order_id": order_id})
        span.set_status(StatusCode.OK)
    except Exception as exc:
        span.record_exception(exc)
        span.set_status(StatusCode.ERROR)
        raise
    finally:
        context_api.detach(token)
        span.end()   # explicit end — no context manager to call it

Node.js — startActiveSpan callback sets the span as active for the duration:

const { trace, SpanStatusCode } = require('@opentelemetry/api');

const tracer = trace.getTracer('checkout-module', '1.0.0');

async function processPayment(orderId, amount) {
  // startActiveSpan makes the span active in AsyncLocalStorage for any
  // instrumentation that runs inside the callback
  return tracer.startActiveSpan('process_payment', async (span) => {
    span.setAttributes({
      'order.id': orderId,
      'payment.amount_usd': amount,
      'payment.processor': 'stripe',
    });
    try {
      const result = await callStripeApi(orderId, amount);
      span.setAttribute('payment.transaction_id', result.id);
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();   // always end — even on error paths
    }
  });
}

Step 4 — Bridge Async Boundaries

Thread pools and message consumers break the context slot. Propagate context explicitly.

Python thread pool — copy the active context before submitting:

import contextvars
from concurrent.futures import ThreadPoolExecutor
from opentelemetry import context as context_api

def submit_with_context(executor: ThreadPoolExecutor, fn, *args):
    # Snapshot the current context; the thread gets its own copy
    ctx = context_api.copy_context()
    return executor.submit(ctx.run, fn, *args)

Node.js — AsyncLocalStorage propagates automatically within async/await chains. Problems arise only with raw callbacks passed to non-instrumented C++ addons or worker_threads. Propagate manually using context.bind():

const { context } = require('@opentelemetry/api');

// Bind a callback to the currently active context before handing it off
const boundCallback = context.bind(context.active(), myCallback);
someExternalEmitter.on('data', boundCallback);

Kafka consumer — extract trace context from the message header and create a linked child span:

from opentelemetry.propagate import extract
from opentelemetry.trace import Link

def consume_message(msg):
    # Reconstruct the context from W3C traceparent header in message metadata
    carrier = {k: v.decode() for k, v in (msg.headers or [])}
    remote_ctx = extract(carrier)

    with tracer.start_as_current_span(
        "consumer.process_fulfilment",
        context=remote_ctx,               # parent = producer's span
        kind=trace.SpanKind.CONSUMER,
    ) as span:
        span.set_attribute("messaging.system", "kafka")
        span.set_attribute("messaging.destination", msg.topic())
        _process(msg.value())

Step 5 — Configure the Span Processor Pipeline

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider(resource=resource)

provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(endpoint="http://otel-collector:4317"),
        max_queue_size=4096,          # raise from default 2048 for burst traffic
        max_export_batch_size=512,
        schedule_delay_millis=3000,   # flush every 3 s; lower = less data loss on crash
        export_timeout_millis=10000,
    )
)

Verification

After deploying both auto and manual spans, run a representative workflow and confirm the full trace appears in Jaeger or Tempo.

Jaeger UI query — search by service name and operation:

Service: checkout-service
Operation: process_payment
Min Duration: 0ms

Expect a waterfall showing POST /checkout (auto, root) → process_payment (manual, child) → call_fraud_api (manual, grandchild) and a sibling SELECT orders (auto, child of root). If process_payment appears as a root span with a different traceID, the context slot was empty when the manual span was created — see Edge Cases below.

CLI smoke test using the OTLP HTTP exporter trace file sink (useful in CI):

# Export spans to a local file, then grep for the manual span name
OTEL_TRACES_EXPORTER=file \
OTEL_EXPORTER_FILE_PATH=/tmp/spans.json \
python -m pytest tests/integration/test_checkout.py -k test_payment_flow

grep '"process_payment"' /tmp/spans.json | jq '.traceId' | sort -u
# Should return exactly one trace ID — the same as the HTTP root span

Edge Cases and Gotchas

Auto-instrumentation agent loaded after app code — require/import order in Node.js is irreversible; if express or pg is imported before tracing.js, the instrumentation patch never applies. Always use --require ./tracing.js or ensure tracing.js is the first import in the entry module.
Thread pool context loss in Python — concurrent.futures.ThreadPoolExecutor does not copy contextvars state into worker threads automatically (this differs from asyncio tasks, which do inherit context). Failing to use ctx.run() as shown in Step 4 produces orphaned spans every time.
span.end() called twice — Using both a with context manager and an explicit span.end() call sends the span twice to the processor. The second call is silently ignored by the SDK but wastes CPU. Pick one pattern per span.
Exception swallowed before record_exception — Catching and re-raising in a bare except: raise block without calling span.record_exception() leaves the span with STATUS_UNSET and no exception event. Always call record_exception in the except branch.
High-cardinality attribute values — Storing unbounded values like raw SQL queries or full user-agent strings in span attributes causes tail-based sampling backends to reject spans and Jaeger/Tempo index bloat. Use truncation or replace raw values with canonical identifiers (e.g., db.operation instead of db.statement).
OTEL_PROPAGATORS mismatch across services — If service A exports tracecontext headers but service B is configured with b3 only, W3C TraceContext propagation headers are ignored and the receiving span starts a new root trace. Standardise OTEL_PROPAGATORS=tracecontext,baggage across the entire fleet.
Sidecar proxy strips propagation headers — Envoy and Linkerd sidecars pass through traceparent by default, but some WAF or API gateway configurations strip unknown headers. If spans from downstream services appear as roots, capture the raw request headers on both sides and compare the traceparent value before and after the proxy hop.

Performance and Scale Notes

Auto-instrumentation agents impose 2–5% CPU overhead on typical HTTP workloads. The cost comes from bytecode transformation at class-load time (Java) or module patching at import time (Python/Node.js) rather than per-request execution. Once patched, the per-span overhead is 2–10 µs per span on modern hardware — negligible until spans-per-second exceeds roughly 50,000.

Manual spans add less overhead than auto spans because they skip attribute inference heuristics. The bottleneck shifts to the BatchSpanProcessor flush cycle. Default settings (max_queue_size=2048, schedule_delay_millis=5000) suit services under 500 RPS. Above that threshold:

Increase max_queue_size to 8192–16384.
Reduce schedule_delay_millis to 1000–2000 ms to prevent queue saturation.
Move the exporter target to a local OpenTelemetry Collector sidecar to eliminate network round-trip latency from the flush path.
Monitor otel_bsp_dropped_spans_total (exposed via the SDK’s metric exporter) — any non-zero value signals queue overflow.

Head-based sampling at the SDK level (probabilistic sampler, ratio 0.1–0.5) is the fastest way to cut volume before spans reach the processor queue. Apply it at the TracerProvider level, not inside business logic, so sampling decisions are consistent across the trace.

Troubleshooting FAQ

Q: Why do my manually created spans appear as root spans instead of children?

The active context is absent when span creation happens. This usually means the context was not propagated across a thread boundary, an async task, or a message queue hop. Capture the active context before the boundary and restore it on the other side using contextvars (Python) or AsyncLocalStorage (Node.js). Confirm the fix by checking that both spans share the same traceId in Jaeger.

Q: Can auto-instrumentation and manual instrumentation coexist in the same service?

Yes. The auto-instrumentation agent establishes the root or entry span; manual spans nest inside it as children by reading the active context via opentelemetry.trace.get_current_span() or context.active(). Both approaches share the same TracerProvider and propagate through the same context object.

Q: My spans show status OK but exceptions are not recorded — why?

Setting StatusCode.OK before the exception is raised overwrites the error state. Always call span.record_exception(e) before span.set_status(StatusCode.ERROR). With Python context managers, an unhandled exception inside the with block records and sets error status automatically; explicit span handles require manual exception recording.

Q: How do I prevent the BatchSpanProcessor from dropping spans under load?

Tune max_queue_size (default 2048) and max_export_batch_size (default 512) upward proportional to your peak RPS. Monitor the otel_bsp_dropped_spans metric emitted by the SDK. If the exporter endpoint latency exceeds schedule_delay_millis (default 5000 ms), spans accumulate faster than they are flushed — increase the delay or move the exporter to a local OpenTelemetry Collector sidecar.

Q: Does auto-instrumentation capture Kafka consumer spans automatically?

Only with an explicit Kafka instrumentation package such as opentelemetry-instrumentation-kafka-python or @opentelemetry/instrumentation-kafkajs. Without it, consumer processing runs outside any trace context; you must extract the W3C TraceContext header from the message record and manually create a child span linked to the producer’s trace, as shown in Step 4.

Manual Span Creation for Custom Business Logic — production patterns for span lifecycle, exception recording, and attribute enrichment
OpenTelemetry SDK Setup for Backend Services — TracerProvider initialisation, exporter configuration, and resource attributes
Handling Async Boundaries in Node.js and Python — context propagation across thread pools, event loops, and message queues
Context Propagation Across Service Meshes — sidecar proxy header forwarding and mesh-level trace continuity
Span Lifecycle and Parent-Child Relationships — how span context scoping determines trace topology

↑ Back to SDK Implementation & Context Propagation