Auto-Instrumentation vs Manual Span Creation
Problem Framing
An OpenTelemetry deployment is live, the exporter reaches the backend, and the dashboard shows spans — yet incident investigations repeatedly dead-end. HTTP entry spans exist but the database call three hops in carries no db.statement attribute. A payment workflow generates a root span with no children even though the processor ran. A Celery worker’s spans appear disconnected from the HTTP request that triggered them. The symptoms share one cause: the gap between what framework hooks observe automatically and what custom business logic actually executes. Choosing incorrectly between auto-instrumentation and manual span creation — or failing to connect the two — leaves exactly those gaps that matter most during an outage.
Prerequisites
- OpenTelemetry SDK Setup for Backend Services completed:
TracerProvider, resource attributes, and exporter configured. - OpenTelemetry SDK version 1.20+ (Python) or
@opentelemetry/sdk-node0.45+ (Node.js). - A running Jaeger or Tempo backend, or an OTLP-compatible collector endpoint.
- Familiarity with span lifecycle and parent-child relationships — specifically how context objects scope parent references.
Concept Deep-Dive: The Span Lifecycle and Context Slot
Every span in OpenTelemetry follows a state machine: Started → Active → Ended → Exported. A span becomes “active” by being attached to the current context slot — a thread-local or async-local storage key that descendant code reads when it needs a parent reference. Auto-instrumentation hooks operate at framework entry points (HTTP handler dispatch, gRPC method execution, database driver calls) and manage this context slot automatically. Manual span creation reads the same slot to establish the parent, then writes a new span into it for the duration of its with block or explicit lifecycle.
The diagram below shows how the context slot connects both approaches within a single request.
The key insight: both approaches write to and read from the same context slot. Auto-instrumentation owns the entry and framework spans; manual spans attach as children by reading that inherited context. Disconnect them — for example by crossing a thread boundary without propagating the context — and child spans become orphaned root spans with new, unrelated trace IDs.
Step-by-Step Implementation
Step 1 — Attach the Auto-Instrumentation Agent
Python uses the opentelemetry-instrument entry-point wrapper. Install the SDK plus any framework-specific packages:
pip install opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-grpc \
opentelemetry-instrumentation-fastapi \
opentelemetry-instrumentation-sqlalchemy \
opentelemetry-instrumentation-httpx
Launch with the agent:
OTEL_SERVICE_NAME=checkout-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
OTEL_PROPAGATORS=tracecontext,baggage \
opentelemetry-instrument uvicorn app.main:app --host 0.0.0.0 --port 8000
Node.js initialises the SDK programmatically before any require/import of application code:
// tracing.js — must be required FIRST via --require flag
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-proto');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: 'http://localhost:4318/v1/traces' }),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());
node --require ./tracing.js server.js
Step 2 — Verify Baseline Coverage Before Adding Manual Spans
Query Jaeger or Tempo to confirm framework spans arrive with the correct structure:
# Jaeger HTTP API: find recent traces for the service
curl -s "http://localhost:16686/api/traces?service=checkout-service&limit=5" \
| jq '.data[0].spans[] | {operationName, tags: (.tags | map(select(.key=="span.kind")))}'
Expected output confirms HTTP server spans with span.kind: server:
{ "operationName": "POST /checkout", "tags": [{ "key": "span.kind", "value": "server" }] }
{ "operationName": "SELECT", "tags": [{ "key": "span.kind", "value": "client" }] }
If no spans appear at all, check that OTEL_EXPORTER_OTLP_ENDPOINT resolves and the collector is listening. If spans appear but lack database children, verify the SQLAlchemy or pg instrumentation package is installed and the instrumentation is not filtered by OTEL_PYTHON_EXCLUDED_URLS.
Step 3 — Add Manual Spans for Business Logic
Obtain a tracer from the global provider. The tracer name should identify the module, not the service (the service name is already in the resource).
Python — using context manager (preferred for synchronous code):
import opentelemetry.trace as trace
from opentelemetry.trace import StatusCode
tracer = trace.get_tracer(__name__)
def process_payment(order_id: str, amount: float) -> dict:
# start_as_current_span writes the new span into the context slot
with tracer.start_as_current_span("process_payment") as span:
# Attach business attributes using semantic conventions where applicable
span.set_attribute("order.id", order_id)
span.set_attribute("payment.amount_usd", amount)
span.set_attribute("payment.processor", "stripe")
try:
result = _call_stripe_api(order_id, amount)
span.set_attribute("payment.transaction_id", result["id"])
span.set_status(StatusCode.OK)
return result
except StripeError as exc:
# record_exception captures stack trace; set_status marks the span ERROR
span.record_exception(exc)
span.set_status(StatusCode.ERROR, description=str(exc))
raise
Python — explicit handle required for async hand-off across boundaries:
async def enqueue_fulfilment(order_id: str) -> None:
# Capture the span explicitly so it can be ended after the await
span = tracer.start_span("enqueue_fulfilment")
ctx = trace.use_span(span, end_on_exit=False)
token = context_api.attach(ctx)
try:
await message_queue.publish("fulfilment", {"order_id": order_id})
span.set_status(StatusCode.OK)
except Exception as exc:
span.record_exception(exc)
span.set_status(StatusCode.ERROR)
raise
finally:
context_api.detach(token)
span.end() # explicit end — no context manager to call it
Node.js — startActiveSpan callback sets the span as active for the duration:
const { trace, SpanStatusCode } = require('@opentelemetry/api');
const tracer = trace.getTracer('checkout-module', '1.0.0');
async function processPayment(orderId, amount) {
// startActiveSpan makes the span active in AsyncLocalStorage for any
// instrumentation that runs inside the callback
return tracer.startActiveSpan('process_payment', async (span) => {
span.setAttributes({
'order.id': orderId,
'payment.amount_usd': amount,
'payment.processor': 'stripe',
});
try {
const result = await callStripeApi(orderId, amount);
span.setAttribute('payment.transaction_id', result.id);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
throw err;
} finally {
span.end(); // always end — even on error paths
}
});
}
Step 4 — Bridge Async Boundaries
Thread pools and message consumers break the context slot. Propagate context explicitly.
Python thread pool — copy the active context before submitting:
import contextvars
from concurrent.futures import ThreadPoolExecutor
from opentelemetry import context as context_api
def submit_with_context(executor: ThreadPoolExecutor, fn, *args):
# Snapshot the current context; the thread gets its own copy
ctx = context_api.copy_context()
return executor.submit(ctx.run, fn, *args)
Node.js — AsyncLocalStorage propagates automatically within async/await chains. Problems arise only with raw callbacks passed to non-instrumented C++ addons or worker_threads. Propagate manually using context.bind():
const { context } = require('@opentelemetry/api');
// Bind a callback to the currently active context before handing it off
const boundCallback = context.bind(context.active(), myCallback);
someExternalEmitter.on('data', boundCallback);
Kafka consumer — extract trace context from the message header and create a linked child span:
from opentelemetry.propagate import extract
from opentelemetry.trace import Link
def consume_message(msg):
# Reconstruct the context from W3C traceparent header in message metadata
carrier = {k: v.decode() for k, v in (msg.headers or [])}
remote_ctx = extract(carrier)
with tracer.start_as_current_span(
"consumer.process_fulfilment",
context=remote_ctx, # parent = producer's span
kind=trace.SpanKind.CONSUMER,
) as span:
span.set_attribute("messaging.system", "kafka")
span.set_attribute("messaging.destination", msg.topic())
_process(msg.value())
Step 5 — Configure the Span Processor Pipeline
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
provider = TracerProvider(resource=resource)
provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://otel-collector:4317"),
max_queue_size=4096, # raise from default 2048 for burst traffic
max_export_batch_size=512,
schedule_delay_millis=3000, # flush every 3 s; lower = less data loss on crash
export_timeout_millis=10000,
)
)
Verification
After deploying both auto and manual spans, run a representative workflow and confirm the full trace appears in Jaeger or Tempo.
Jaeger UI query — search by service name and operation:
Service: checkout-service
Operation: process_payment
Min Duration: 0ms
Expect a waterfall showing POST /checkout (auto, root) → process_payment (manual, child) → call_fraud_api (manual, grandchild) and a sibling SELECT orders (auto, child of root). If process_payment appears as a root span with a different traceID, the context slot was empty when the manual span was created — see Edge Cases below.
CLI smoke test using the OTLP HTTP exporter trace file sink (useful in CI):
# Export spans to a local file, then grep for the manual span name
OTEL_TRACES_EXPORTER=file \
OTEL_EXPORTER_FILE_PATH=/tmp/spans.json \
python -m pytest tests/integration/test_checkout.py -k test_payment_flow
grep '"process_payment"' /tmp/spans.json | jq '.traceId' | sort -u
# Should return exactly one trace ID — the same as the HTTP root span
Edge Cases and Gotchas
-
Auto-instrumentation agent loaded after app code —
require/importorder in Node.js is irreversible; ifexpressorpgis imported beforetracing.js, the instrumentation patch never applies. Always use--require ./tracing.jsor ensuretracing.jsis the firstimportin the entry module. -
Thread pool context loss in Python —
concurrent.futures.ThreadPoolExecutordoes not copycontextvarsstate into worker threads automatically (this differs fromasynciotasks, which do inherit context). Failing to usectx.run()as shown in Step 4 produces orphaned spans every time. -
span.end()called twice — Using both awithcontext manager and an explicitspan.end()call sends the span twice to the processor. The second call is silently ignored by the SDK but wastes CPU. Pick one pattern per span. -
Exception swallowed before
record_exception— Catching and re-raising in a bareexcept: raiseblock without callingspan.record_exception()leaves the span withSTATUS_UNSETand no exception event. Always callrecord_exceptionin the except branch. -
High-cardinality attribute values — Storing unbounded values like raw SQL queries or full user-agent strings in span attributes causes tail-based sampling backends to reject spans and Jaeger/Tempo index bloat. Use truncation or replace raw values with canonical identifiers (e.g.,
db.operationinstead ofdb.statement). -
OTEL_PROPAGATORSmismatch across services — If service A exportstracecontextheaders but service B is configured withb3only, W3C TraceContext propagation headers are ignored and the receiving span starts a new root trace. StandardiseOTEL_PROPAGATORS=tracecontext,baggageacross the entire fleet. -
Sidecar proxy strips propagation headers — Envoy and Linkerd sidecars pass through
traceparentby default, but some WAF or API gateway configurations strip unknown headers. If spans from downstream services appear as roots, capture the raw request headers on both sides and compare thetraceparentvalue before and after the proxy hop.
Performance and Scale Notes
Auto-instrumentation agents impose 2–5% CPU overhead on typical HTTP workloads. The cost comes from bytecode transformation at class-load time (Java) or module patching at import time (Python/Node.js) rather than per-request execution. Once patched, the per-span overhead is 2–10 µs per span on modern hardware — negligible until spans-per-second exceeds roughly 50,000.
Manual spans add less overhead than auto spans because they skip attribute inference heuristics. The bottleneck shifts to the BatchSpanProcessor flush cycle. Default settings (max_queue_size=2048, schedule_delay_millis=5000) suit services under 500 RPS. Above that threshold:
- Increase
max_queue_sizeto 8192–16384. - Reduce
schedule_delay_millisto 1000–2000 ms to prevent queue saturation. - Move the exporter target to a local OpenTelemetry Collector sidecar to eliminate network round-trip latency from the flush path.
- Monitor
otel_bsp_dropped_spans_total(exposed via the SDK’s metric exporter) — any non-zero value signals queue overflow.
Head-based sampling at the SDK level (probabilistic sampler, ratio 0.1–0.5) is the fastest way to cut volume before spans reach the processor queue. Apply it at the TracerProvider level, not inside business logic, so sampling decisions are consistent across the trace.
Troubleshooting FAQ
Q: Why do my manually created spans appear as root spans instead of children?
The active context is absent when span creation happens. This usually means the context was not propagated across a thread boundary, an async task, or a message queue hop. Capture the active context before the boundary and restore it on the other side using contextvars (Python) or AsyncLocalStorage (Node.js). Confirm the fix by checking that both spans share the same traceId in Jaeger.
Q: Can auto-instrumentation and manual instrumentation coexist in the same service?
Yes. The auto-instrumentation agent establishes the root or entry span; manual spans nest inside it as children by reading the active context via opentelemetry.trace.get_current_span() or context.active(). Both approaches share the same TracerProvider and propagate through the same context object.
Q: My spans show status OK but exceptions are not recorded — why?
Setting StatusCode.OK before the exception is raised overwrites the error state. Always call span.record_exception(e) before span.set_status(StatusCode.ERROR). With Python context managers, an unhandled exception inside the with block records and sets error status automatically; explicit span handles require manual exception recording.
Q: How do I prevent the BatchSpanProcessor from dropping spans under load?
Tune max_queue_size (default 2048) and max_export_batch_size (default 512) upward proportional to your peak RPS. Monitor the otel_bsp_dropped_spans metric emitted by the SDK. If the exporter endpoint latency exceeds schedule_delay_millis (default 5000 ms), spans accumulate faster than they are flushed — increase the delay or move the exporter to a local OpenTelemetry Collector sidecar.
Q: Does auto-instrumentation capture Kafka consumer spans automatically?
Only with an explicit Kafka instrumentation package such as opentelemetry-instrumentation-kafka-python or @opentelemetry/instrumentation-kafkajs. Without it, consumer processing runs outside any trace context; you must extract the W3C TraceContext header from the message record and manually create a child span linked to the producer’s trace, as shown in Step 4.
Related
- Manual Span Creation for Custom Business Logic — production patterns for span lifecycle, exception recording, and attribute enrichment
- OpenTelemetry SDK Setup for Backend Services — TracerProvider initialisation, exporter configuration, and resource attributes
- Handling Async Boundaries in Node.js and Python — context propagation across thread pools, event loops, and message queues
- Context Propagation Across Service Meshes — sidecar proxy header forwarding and mesh-level trace continuity
- Span Lifecycle and Parent-Child Relationships — how span context scoping determines trace topology
↑ Back to SDK Implementation & Context Propagation