Choosing Between Head-Based and Tail-Based Sampling
You deployed distributed tracing across your microservices and set a 5% sampling rate to control costs. Three weeks later, a P1 incident surfaces and there are no traces for the failing code path — the sampling dice never landed on the right requests. That is the defining failure mode of head-based sampling applied without error-aware policies. The opposite failure is equally real: you switch to tail-based sampling, your OpenTelemetry Collector runs out of memory, and the pipeline crashes under Tuesday’s traffic spike. Picking the right strategy means understanding where in the request lifecycle the sampling decision is made, what each architecture cannot see at that moment, and what your collector infrastructure can sustain.
Prerequisites
Before applying anything on this page, confirm:
- OpenTelemetry SDK installed and initialized (Python
opentelemetry-sdk >= 1.20, Javaopentelemetry-java >= 1.30, Gogo.opentelemetry.io/otel >= 1.20) - At least one OpenTelemetry SDK setup for backend services completed so spans are already flowing
- OpenTelemetry Collector deployed (version
>= 0.90) if you plan to use tail-based sampling - Familiarity with W3C TraceContext propagation — the
traceparentheader carries the sampling flag across service boundaries
How Sampling Decisions Are Made: Ingress vs. Egress
The architectural split is simple: head-based sampling decides before a span is recorded; tail-based sampling decides after every span in a trace has been received. Everything else — memory requirements, error visibility, propagation complexity — flows from that single difference.
The diagram below shows where each decision point sits in the pipeline:
Head-Based Sampling: Decision at the SDK
Head sampling executes synchronously inside the application process, before any span data is serialized or shipped. Two mechanisms cover nearly all production use cases:
Probabilistic (ratio-based): Hashes the traceId to produce a deterministic float in [0, 1). If the value falls below the configured ratio, the trace is sampled. Because the hash is deterministic, all services in a distributed call chain that receive the same traceId will independently arrive at the same decision — provided they use the same algorithm.
Parent-based: Inherits the sampling flag from the upstream caller’s traceparent header. If the upstream marked the trace as sampled (01), all downstream spans are recorded. If it marked it as not-sampled (00), downstream spans are dropped immediately without context propagation overhead.
Production characteristics:
- CPU cost: ~10–50 µs per request for hash computation and context injection. Negligible for most workloads; measurable in sub-millisecond gRPC call paths at high fan-out.
- Memory cost: Stateless — no span buffering. Works inside memory-constrained sidecar containers and edge deployments.
- Blind spot: Errors and latency outliers in low-traffic services are statistically likely to be dropped before they are ever recorded.
# OpenTelemetry Python SDK — ParentBased with probabilistic root fallback
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
# 10% sampling for root (entry-point) traces; child spans inherit the decision
root_sampler = TraceIdRatioBased(0.1)
sampler = ParentBased(root_sampler)
provider = TracerProvider(sampler=sampler)
Environment-variable override — set these per deployment without recompiling:
# .env or Kubernetes ConfigMap
OTEL_TRACES_SAMPLER: parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG: "0.10"
If OTEL_TRACES_SAMPLER_ARG is absent, the SDK defaults to 1.0 (100% sampling). Always validate this default in staging — it will saturate your storage in production.
Tail-Based Sampling: Decision at the Collector
Tail sampling decouples the retention decision from span generation. The OpenTelemetry Collector receives every span from every service with always_on SDK sampling, groups spans by traceId in an in-memory buffer, and evaluates retention policies once the trace is deemed complete or a decision_wait timeout fires.
Production characteristics:
- Evaluation window: Typically 10–30 seconds. Asynchronous spans arriving after the window closes are either attached to a decided trace (if the trace ID is still cached) or dropped. Tune
decision_waitto exceed your longest async leg. - Memory cost: All spans for active traces are held in memory. At 10 k spans/s with 2 KB average payload and a 15-second window, that is ~300 MB baseline. Provision 2–4 GB total for GC headroom.
- Network cost: Every span transits the network to the Collector regardless of retention outcome, increasing egress costs and requiring robust TLS/mTLS termination at the Collector endpoint.
- Collector scalability: A single Collector instance cannot shard the buffer — a trace must be fully resident on one instance. Scaling requires a load-balancing tier routing by
traceId(see Edge Cases below).
Step-by-Step Implementation
Step 1. Configure SDK-Level Head Sampling
Set up ParentBased(TraceIdRatioBased) in every service that acts as a trace root. Services that only receive downstream calls should use ParentBased(AlwaysOff) as the root fallback to avoid creating accidental root spans.
import os
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import (
ParentBased, TraceIdRatioBased, ALWAYS_OFF
)
from opentelemetry.sdk.resources import Resource
ratio = float(os.environ.get("OTEL_TRACES_SAMPLER_ARG", "0.10"))
# Entry-point services (API gateways, background job runners):
provider = TracerProvider(
resource=Resource.create({"service.name": "payment-api"}),
sampler=ParentBased(root=TraceIdRatioBased(ratio)),
)
# Interior services that should never create root traces:
interior_provider = TracerProvider(
resource=Resource.create({"service.name": "ledger-service"}),
sampler=ParentBased(root=ALWAYS_OFF),
)
Propagate the sampling decision across service boundaries by extracting traceparent on every inbound request and injecting it on every outbound call. See W3C TraceContext propagation for the full inject/extract pattern.
Step 2. Deploy the tail_sampling Processor
The tail_sampling processor chains policies sequentially. The first matching policy wins. Order your policies from most specific (error retention) to least specific (default drop):
# otel-collector-config.yaml
processors:
memory_limiter:
check_interval: 1s
limit_mib: 3000
spike_limit_mib: 600
tail_sampling:
decision_wait: 15s
expected_new_spans_per_trace: 12
policies:
- name: always-keep-errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: keep-slow-traces
type: latency
latency: { threshold_ms: 1500 }
- name: keep-high-value-services
type: string_attribute
string_attribute:
key: "service.name"
values: ["payment-gateway", "auth-service", "fraud-detector"]
- name: probabilistic-keep-5pct
type: probabilistic
probabilistic: { sampling_percentage: 5 }
- name: default-drop
type: always_off
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling]
exporters: [otlp/storage]
Place memory_limiter before tail_sampling in the pipeline. When the Collector approaches its memory limit, the memory limiter starts refusing new spans — which is far preferable to an OOM crash that loses the entire buffer.
Step 3. Build a Hybrid Architecture (Recommended for High-Traffic Services)
A hybrid applies lightweight head sampling to reduce Collector memory pressure, then uses tail sampling to filter the pre-sampled stream for high-value traces. This gives you predictable infrastructure cost while preserving full error visibility within the sampled stream.
# SDK: head sample at 20% — reduces Collector ingestion load by 80%
sampler = ParentBased(root=TraceIdRatioBased(0.2))
# Collector: tail sampling on the 20% stream
processors:
tail_sampling:
decision_wait: 10s
expected_new_spans_per_trace: 8
policies:
- name: keep-all-errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: keep-p95-violations
type: latency
latency: { threshold_ms: 2000 }
- name: drop-rest
type: always_off
With a 20% head ratio, every error that enters the pre-sampled stream is retained at 100%. Collector memory requirements drop proportionally. The trade-off: errors in the 80% discarded by head sampling are permanently lost. If your service error rate is under 1%, a 20% head ratio still captures roughly 20% of errors — acceptable for low-criticality services but not for payment or auth flows where always_on head sampling is appropriate.
Step 4. Verify Trace Completeness
After deploying, confirm the pipeline end-to-end before relying on it in production:
-
Inject a synthetic error. Issue a request that triggers a 500 response. Check your Jaeger or Tempo backend — the complete trace including all child spans must appear within
decision_wait + export_latencyseconds (typically under 30 seconds). -
Check the Collector metrics. The
otelcol_processor_tail_sampling_sampling_decision_timer_bucketmetric shows the distribution of decision latencies. Theotelcol_processor_tail_sampling_count_traces_sampledandcount_traces_not_sampledcounters confirm policy hit rates. -
Validate cross-service continuity. For a multi-service trace, confirm that spans from all services appear under the same
traceIdin the UI. Fragmented traces (spans missing from downstream services) indicate a propagation bug — re-read the W3C TraceContext inject/extract implementation.
# Query Jaeger HTTP API to verify a specific trace exists
curl -s "http://localhost:16686/api/traces/<TRACE_ID>" \
| jq '.data[0].spans | length'
# Expect: number of spans equal to your service count × avg spans per service
Step 5. Handle Async Jobs and Message Queue Traces
Context propagation across service meshes covers sidecar scenarios, but message queues require an additional pattern: attaching traceparent to message metadata so the consumer can restore the trace context.
# Producer: inject traceparent into Kafka message headers
from opentelemetry.propagate import inject
headers = {}
inject(headers) # adds traceparent, tracestate
producer.send(
"payment-events",
value=payload,
headers=[(k, v.encode()) for k, v in headers.items()],
)
# Consumer: extract context before creating the processing span
from opentelemetry.propagate import extract
from opentelemetry import trace
def process_message(msg):
carrier = {k: v.decode() for k, v in msg.headers}
ctx = extract(carrier)
tracer = trace.get_tracer("payment-consumer")
with tracer.start_as_current_span("process-payment", context=ctx):
# span is now a child of the producer's trace
handle(msg.value)
Close the span explicitly after job completion, not when the message is dequeued. Use the tail_sampling processor’s string_attribute policy to target high-volume background jobs for drop, while keeping the status_code: ERROR policy at higher priority to preserve failure traces.
Edge Cases & Gotchas
-
Tail sampling across multiple Collector instances breaks trace grouping. Each instance sees only a partial span set for a given
traceId— the retention policy fires with incomplete data. Fix: deploy aloadbalancingexporter tier in front of your tail-sampling Collectors, routing bytraceIdso all spans for one trace land on one instance. -
decision_waittoo short for async spans. If your async consumers take 20 seconds to process a message butdecision_waitis 15 seconds, the consumer’s spans arrive after the decision and are dropped, leaving incomplete traces in storage. Setdecision_waitto at least your 99th-percentile end-to-end trace duration plus a 20% buffer. -
Missing
OTEL_TRACES_SAMPLER_ARGdefaults to 100%. A misconfigured ConfigMap or missing environment variable causes all head samplers to default toalways_on, saturating the Collector and storage overnight. -
Probabilistic head sampling is not uniform for short-lived services. A service handling 50 req/min at a 10% ratio may go 20–30 minutes between sampled traces. Use a minimum floor:
max(ratio, 1 request per minute)by combiningTraceIdRatioBasedwith a rate-limiting sampler. -
expected_new_spans_per_tracemisconfigured. If this value is set too low (say 5 when your traces average 40 spans), the Collector pre-allocates too little memory and triggers excessive GC. Profile your actual average with theotelcol_processor_tail_sampling_new_trace_id_receivedmetric and set the value to P90 span count per trace. -
Head sampling breaks cross-service A/B analysis. If Service A samples at 10% and Service B independently samples at 10%, the probability that a cross-service trace is fully captured falls to 1% rather than 10%.
ParentBasedsampling fixes this — child services must inherit the parent’s decision, not resample independently.
Performance & Scale Notes
Collector memory: The tail_sampling processor holds all spans for in-flight traces. Memory scales as spans_per_second × avg_span_bytes × decision_wait_seconds. At 10 k s/s, 2 KB/span, 15 s window: ~300 MB. With baggage or large attribute sets on each span, payload sizes can reach 5–10 KB — multiply accordingly. Always deploy a memory_limiter processor upstream.
Batch exporter tuning: The batch processor between the tail sampler and the storage exporter buffers decided traces before writing. Set send_batch_size to 512–1024 and timeout to 5 s for most backends. For Jaeger or Tempo, match the batch size to the backend’s write buffer — Tempo’s object-storage backend benefits from larger batches (2048+) to reduce PUT request overhead.
Cardinality at the storage layer: Tail sampling increases write amplification because every retained trace arrives as a complete burst when the decision_wait timer fires for multiple traces simultaneously. Backend storage engines with columnar compression (Tempo’s Parquet blocks) handle these bursts more efficiently than row-oriented stores.
CPU cost of policy evaluation: Each policy in the tail_sampling chain iterates over all spans in the trace. With 10 policies and 40 spans per trace at 1 k traces/s, that is 400 k policy evaluations per second. Keep your policy list concise and order by selectivity (most-selective first) to short-circuit early.
Troubleshooting FAQ
Why are error traces missing when I use head-based sampling at 10%?
Head-based sampling makes a stateless decision at trace ingress before the outcome is known. At a 10% rate, 90% of errors are discarded before any span is written. Switch to tail-based sampling with a status_code: ERROR policy at the Collector, or implement a hybrid where the Collector keeps all errors from the pre-sampled stream.
How much memory does tail-based sampling require in the Collector?
At 10 k spans/s with a 15-second decision_wait and an average span payload of 2 KB, the Collector must buffer roughly 300 MB baseline. Allow 2–4 GB total for headroom and GC pressure. Deploy the memory_limiter processor upstream of tail_sampling to cap allocation and back-pressure the pipeline rather than OOM-crashing.
Can I change the sampling rate at runtime without redeploying?
For head sampling, update the OTEL_TRACES_SAMPLER_ARG environment variable and restart the process (or send SIGHUP if your process manager supports it). For tail sampling in the Collector, update the config file and send SIGHUP — the Collector hot-reloads processor configuration without dropping the active pipeline.
What happens when tail-based sampling spans are fragmented across multiple Collector instances?
Each Collector instance only sees partial spans for a given traceId, causing incomplete trace evaluation and incorrect policy decisions (an error trace may appear error-free on the instance holding only non-error spans). Fix this by deploying the loadbalancing exporter as a routing tier, configured with routing_key: traceId, in front of your tail-sampling Collector pool.
Does tail-based sampling increase storage write amplification?
Yes. Every span transits the Collector before the retention decision, so you pay full ingestion cost. The Collector then writes only retained traces to storage — but at bursty intervals as decision_wait windows close simultaneously. Tune your batch exporter’s send_batch_size and timeout to smooth write bursts, and choose a storage backend with efficient bulk-write support.