Choosing Between Head-Based and Tail-Based Sampling

Q: How much memory does tail-based sampling require in the Collector?

At 10k spans/second with a 15-second decision_wait window and an average span payload of 2 KB, the Collector must buffer roughly 300 MB. Allow 2–4 GB total for headroom and GC pressure. Use the memory_limiter processor upstream of tail_sampling to cap allocation.

Q: Can I change the sampling rate at runtime without redeploying?

For head sampling, update the OTEL_TRACES_SAMPLER_ARG environment variable and send SIGHUP or restart the process. For tail sampling in the Collector, update the config file and send a SIGHUP — the Collector hot-reloads processor config without dropping the pipeline.

Q: Does tail-based sampling increase storage write amplification?

Yes. Because all spans transit the Collector before the retention decision, you pay full ingestion cost. The Collector then writes only retained traces to storage — but at bursty intervals as decisions resolve. Use a batch exporter with send_batch_size tuned to your backend's write buffer to smooth the burst.

You deployed distributed tracing across your microservices and set a 5% sampling rate to control costs. Three weeks later, a P1 incident surfaces and there are no traces for the failing code path — the sampling dice never landed on the right requests. That is the defining failure mode of head-based sampling applied without error-aware policies. The opposite failure is equally real: you switch to tail-based sampling, your OpenTelemetry Collector runs out of memory, and the pipeline crashes under Tuesday’s traffic spike. Picking the right strategy means understanding where in the request lifecycle the sampling decision is made, what each architecture cannot see at that moment, and what your collector infrastructure can sustain.

Prerequisites

Before applying anything on this page, confirm:

OpenTelemetry SDK installed and initialized (Python opentelemetry-sdk >= 1.20, Java opentelemetry-java >= 1.30, Go go.opentelemetry.io/otel >= 1.20)
At least one OpenTelemetry SDK setup for backend services completed so spans are already flowing
OpenTelemetry Collector deployed (version >= 0.90) if you plan to use tail-based sampling
Familiarity with W3C TraceContext propagation — the traceparent header carries the sampling flag across service boundaries

How Sampling Decisions Are Made: Ingress vs. Egress

The architectural split is simple: head-based sampling decides before a span is recorded; tail-based sampling decides after every span in a trace has been received. Everything else — memory requirements, error visibility, propagation complexity — flows from that single difference.

The diagram below shows where each decision point sits in the pipeline:

Head-Based Sampling: Decision at the SDK

Head sampling executes synchronously inside the application process, before any span data is serialized or shipped. Two mechanisms cover nearly all production use cases:

Probabilistic (ratio-based): Hashes the traceId to produce a deterministic float in [0, 1). If the value falls below the configured ratio, the trace is sampled. Because the hash is deterministic, all services in a distributed call chain that receive the same traceId will independently arrive at the same decision — provided they use the same algorithm.

Parent-based: Inherits the sampling flag from the upstream caller’s traceparent header. If the upstream marked the trace as sampled (01), all downstream spans are recorded. If it marked it as not-sampled (00), downstream spans are dropped immediately without context propagation overhead.

Production characteristics:

CPU cost: ~10–50 µs per request for hash computation and context injection. Negligible for most workloads; measurable in sub-millisecond gRPC call paths at high fan-out.
Memory cost: Stateless — no span buffering. Works inside memory-constrained sidecar containers and edge deployments.
Blind spot: Errors and latency outliers in low-traffic services are statistically likely to be dropped before they are ever recorded.

# OpenTelemetry Python SDK — ParentBased with probabilistic root fallback
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased

# 10% sampling for root (entry-point) traces; child spans inherit the decision
root_sampler = TraceIdRatioBased(0.1)
sampler = ParentBased(root_sampler)

provider = TracerProvider(sampler=sampler)

Environment-variable override — set these per deployment without recompiling:

# .env or Kubernetes ConfigMap
OTEL_TRACES_SAMPLER: parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG: "0.10"

If OTEL_TRACES_SAMPLER_ARG is absent, the SDK defaults to 1.0 (100% sampling). Always validate this default in staging — it will saturate your storage in production.

Tail-Based Sampling: Decision at the Collector

Tail sampling decouples the retention decision from span generation. The OpenTelemetry Collector receives every span from every service with always_on SDK sampling, groups spans by traceId in an in-memory buffer, and evaluates retention policies once the trace is deemed complete or a decision_wait timeout fires.

Production characteristics:

Evaluation window: Typically 10–30 seconds. Asynchronous spans arriving after the window closes are either attached to a decided trace (if the trace ID is still cached) or dropped. Tune decision_wait to exceed your longest async leg.
Memory cost: All spans for active traces are held in memory. At 10 k spans/s with 2 KB average payload and a 15-second window, that is ~300 MB baseline. Provision 2–4 GB total for GC headroom.
Network cost: Every span transits the network to the Collector regardless of retention outcome, increasing egress costs and requiring robust TLS/mTLS termination at the Collector endpoint.
Collector scalability: A single Collector instance cannot shard the buffer — a trace must be fully resident on one instance. Scaling requires a load-balancing tier routing by traceId (see Edge Cases below).

Step-by-Step Implementation

Step 1. Configure SDK-Level Head Sampling

Set up ParentBased(TraceIdRatioBased) in every service that acts as a trace root. Services that only receive downstream calls should use ParentBased(AlwaysOff) as the root fallback to avoid creating accidental root spans.

import os
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import (
    ParentBased, TraceIdRatioBased, ALWAYS_OFF
)
from opentelemetry.sdk.resources import Resource

ratio = float(os.environ.get("OTEL_TRACES_SAMPLER_ARG", "0.10"))

# Entry-point services (API gateways, background job runners):
provider = TracerProvider(
    resource=Resource.create({"service.name": "payment-api"}),
    sampler=ParentBased(root=TraceIdRatioBased(ratio)),
)

# Interior services that should never create root traces:
interior_provider = TracerProvider(
    resource=Resource.create({"service.name": "ledger-service"}),
    sampler=ParentBased(root=ALWAYS_OFF),
)

Propagate the sampling decision across service boundaries by extracting traceparent on every inbound request and injecting it on every outbound call. See W3C TraceContext propagation for the full inject/extract pattern.

Step 2. Deploy the tail_sampling Processor

The tail_sampling processor chains policies sequentially. The first matching policy wins. Order your policies from most specific (error retention) to least specific (default drop):

# otel-collector-config.yaml
processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 3000
    spike_limit_mib: 600

  tail_sampling:
    decision_wait: 15s
    expected_new_spans_per_trace: 12
    policies:
      - name: always-keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }

      - name: keep-slow-traces
        type: latency
        latency: { threshold_ms: 1500 }

      - name: keep-high-value-services
        type: string_attribute
        string_attribute:
          key: "service.name"
          values: ["payment-gateway", "auth-service", "fraud-detector"]

      - name: probabilistic-keep-5pct
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

      - name: default-drop
        type: always_off

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling]
      exporters: [otlp/storage]

Place memory_limiter before tail_sampling in the pipeline. When the Collector approaches its memory limit, the memory limiter starts refusing new spans — which is far preferable to an OOM crash that loses the entire buffer.

Step 3. Build a Hybrid Architecture (Recommended for High-Traffic Services)

A hybrid applies lightweight head sampling to reduce Collector memory pressure, then uses tail sampling to filter the pre-sampled stream for high-value traces. This gives you predictable infrastructure cost while preserving full error visibility within the sampled stream.

# SDK: head sample at 20% — reduces Collector ingestion load by 80%
sampler = ParentBased(root=TraceIdRatioBased(0.2))

# Collector: tail sampling on the 20% stream
processors:
  tail_sampling:
    decision_wait: 10s
    expected_new_spans_per_trace: 8
    policies:
      - name: keep-all-errors
        type: status_code
        status_code: { status_codes: [ERROR] }

      - name: keep-p95-violations
        type: latency
        latency: { threshold_ms: 2000 }

      - name: drop-rest
        type: always_off

With a 20% head ratio, every error that enters the pre-sampled stream is retained at 100%. Collector memory requirements drop proportionally. The trade-off: errors in the 80% discarded by head sampling are permanently lost. If your service error rate is under 1%, a 20% head ratio still captures roughly 20% of errors — acceptable for low-criticality services but not for payment or auth flows where always_on head sampling is appropriate.

Step 4. Verify Trace Completeness

After deploying, confirm the pipeline end-to-end before relying on it in production:

Inject a synthetic error. Issue a request that triggers a 500 response. Check your Jaeger or Tempo backend — the complete trace including all child spans must appear within decision_wait + export_latency seconds (typically under 30 seconds).
Check the Collector metrics. The otelcol_processor_tail_sampling_sampling_decision_timer_bucket metric shows the distribution of decision latencies. The otelcol_processor_tail_sampling_count_traces_sampled and count_traces_not_sampled counters confirm policy hit rates.
Validate cross-service continuity. For a multi-service trace, confirm that spans from all services appear under the same traceId in the UI. Fragmented traces (spans missing from downstream services) indicate a propagation bug — re-read the W3C TraceContext inject/extract implementation.

# Query Jaeger HTTP API to verify a specific trace exists
curl -s "http://localhost:16686/api/traces/<TRACE_ID>" \
  | jq '.data[0].spans | length'
# Expect: number of spans equal to your service count × avg spans per service

Step 5. Handle Async Jobs and Message Queue Traces

Context propagation across service meshes covers sidecar scenarios, but message queues require an additional pattern: attaching traceparent to message metadata so the consumer can restore the trace context.

# Producer: inject traceparent into Kafka message headers
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # adds traceparent, tracestate

producer.send(
    "payment-events",
    value=payload,
    headers=[(k, v.encode()) for k, v in headers.items()],
)

# Consumer: extract context before creating the processing span
from opentelemetry.propagate import extract
from opentelemetry import trace

def process_message(msg):
    carrier = {k: v.decode() for k, v in msg.headers}
    ctx = extract(carrier)

    tracer = trace.get_tracer("payment-consumer")
    with tracer.start_as_current_span("process-payment", context=ctx):
        # span is now a child of the producer's trace
        handle(msg.value)

Close the span explicitly after job completion, not when the message is dequeued. Use the tail_sampling processor’s string_attribute policy to target high-volume background jobs for drop, while keeping the status_code: ERROR policy at higher priority to preserve failure traces.

Edge Cases & Gotchas

Tail sampling across multiple Collector instances breaks trace grouping. Each instance sees only a partial span set for a given traceId — the retention policy fires with incomplete data. Fix: deploy a loadbalancing exporter tier in front of your tail-sampling Collectors, routing by traceId so all spans for one trace land on one instance.
decision_wait too short for async spans. If your async consumers take 20 seconds to process a message but decision_wait is 15 seconds, the consumer’s spans arrive after the decision and are dropped, leaving incomplete traces in storage. Set decision_wait to at least your 99th-percentile end-to-end trace duration plus a 20% buffer.
Missing OTEL_TRACES_SAMPLER_ARG defaults to 100%. A misconfigured ConfigMap or missing environment variable causes all head samplers to default to always_on, saturating the Collector and storage overnight.
Probabilistic head sampling is not uniform for short-lived services. A service handling 50 req/min at a 10% ratio may go 20–30 minutes between sampled traces. Use a minimum floor: max(ratio, 1 request per minute) by combining TraceIdRatioBased with a rate-limiting sampler.
expected_new_spans_per_trace misconfigured. If this value is set too low (say 5 when your traces average 40 spans), the Collector pre-allocates too little memory and triggers excessive GC. Profile your actual average with the otelcol_processor_tail_sampling_new_trace_id_received metric and set the value to P90 span count per trace.
Head sampling breaks cross-service A/B analysis. If Service A samples at 10% and Service B independently samples at 10%, the probability that a cross-service trace is fully captured falls to 1% rather than 10%. ParentBased sampling fixes this — child services must inherit the parent’s decision, not resample independently.

Performance & Scale Notes

Collector memory: The tail_sampling processor holds all spans for in-flight traces. Memory scales as spans_per_second × avg_span_bytes × decision_wait_seconds. At 10 k s/s, 2 KB/span, 15 s window: ~300 MB. With baggage or large attribute sets on each span, payload sizes can reach 5–10 KB — multiply accordingly. Always deploy a memory_limiter processor upstream.

Batch exporter tuning: The batch processor between the tail sampler and the storage exporter buffers decided traces before writing. Set send_batch_size to 512–1024 and timeout to 5 s for most backends. For Jaeger or Tempo, match the batch size to the backend’s write buffer — Tempo’s object-storage backend benefits from larger batches (2048+) to reduce PUT request overhead.

Cardinality at the storage layer: Tail sampling increases write amplification because every retained trace arrives as a complete burst when the decision_wait timer fires for multiple traces simultaneously. Backend storage engines with columnar compression (Tempo’s Parquet blocks) handle these bursts more efficiently than row-oriented stores.

CPU cost of policy evaluation: Each policy in the tail_sampling chain iterates over all spans in the trace. With 10 policies and 40 spans per trace at 1 k traces/s, that is 400 k policy evaluations per second. Keep your policy list concise and order by selectivity (most-selective first) to short-circuit early.

Troubleshooting FAQ

Why are error traces missing when I use head-based sampling at 10%?

Head-based sampling makes a stateless decision at trace ingress before the outcome is known. At a 10% rate, 90% of errors are discarded before any span is written. Switch to tail-based sampling with a status_code: ERROR policy at the Collector, or implement a hybrid where the Collector keeps all errors from the pre-sampled stream.

How much memory does tail-based sampling require in the Collector?

At 10 k spans/s with a 15-second decision_wait and an average span payload of 2 KB, the Collector must buffer roughly 300 MB baseline. Allow 2–4 GB total for headroom and GC pressure. Deploy the memory_limiter processor upstream of tail_sampling to cap allocation and back-pressure the pipeline rather than OOM-crashing.

Can I change the sampling rate at runtime without redeploying?

For head sampling, update the OTEL_TRACES_SAMPLER_ARG environment variable and restart the process (or send SIGHUP if your process manager supports it). For tail sampling in the Collector, update the config file and send SIGHUP — the Collector hot-reloads processor configuration without dropping the active pipeline.

What happens when tail-based sampling spans are fragmented across multiple Collector instances?

Each Collector instance only sees partial spans for a given traceId, causing incomplete trace evaluation and incorrect policy decisions (an error trace may appear error-free on the instance holding only non-error spans). Fix this by deploying the loadbalancing exporter as a routing tier, configured with routing_key: traceId, in front of your tail-sampling Collector pool.

Does tail-based sampling increase storage write amplification?

Yes. Every span transits the Collector before the retention decision, so you pay full ingestion cost. The Collector then writes only retained traces to storage — but at bursty intervals as decision_wait windows close simultaneously. Tune your batch exporter’s send_batch_size and timeout to smooth write bursts, and choose a storage backend with efficient bulk-write support.

↑ Back to Distributed Tracing Fundamentals & Architecture