When to use tail-based sampling for microservices
Use tail-based sampling when you need guaranteed retention of every error trace, latency outlier, or business-critical request — and probabilistic head-based rates are silently dropping the failures you need to debug.
Context and when it matters
Head-based sampling commits a keep/drop decision at the first span of a request, before the outcome is known. At a 1–5 % probabilistic rate, failures that occur at sub-0.1 % frequency are statistically invisible: the sampler discards the trace before any ERROR status propagates through the span tree. This is the defining operational threshold for adopting tail-based sampling — the point where probability guarantees break down and deterministic post-hoc retention becomes necessary.
Tail-based sampling moves the decision to a collector-side buffer. Every span for a given trace_id is held in memory during a configurable decision_wait window (typically 30–60 seconds). Once the window closes, explicit policies evaluate the complete trace and commit only the traces that match retention rules. The cost is bounded in-memory overhead and an added decision latency; the benefit is 100 % retention of the traces that matter for root-cause analysis.
Head-based vs tail-based: side-by-side comparison
| Dimension | Head-based | Tail-based |
|---|---|---|
| Decision point | First span, before outcome is known | Collector, after all spans arrive |
| Error retention guarantee | None — low-rate errors are dropped statistically | 100 % of ERROR spans if policy is set |
| P99 latency outlier capture | None — slow traces dropped at same rate as fast ones | Guaranteed — latency policy applied to full trace |
| Memory overhead | Stateless — negligible | In-memory buffer: ~1–2 GB per 10 K traces/sec at 30 s window |
| Decision latency added | Zero | decision_wait window (30–60 s typical) |
| Config complexity | Low — single SDK sampler | Medium — collector pipeline + ordered policy list |
| Async boundary handling | Fragile — independent samplers break trace continuity | Robust — all spans correlated before decision |
| Best for | High-volume, low-severity background traffic | Error-critical, SLO-gated, or compliance-scoped workloads |
Implementation: OpenTelemetry Collector tail sampling configuration
The tail_sampling processor in the OpenTelemetry Collector evaluates policies strictly top-to-bottom; the first matching policy wins. Ordering matters: place deterministic rules before the probabilistic fallback.
processors:
tail_sampling:
# Buffer window for trace completion.
# 30 s is standard; raise to 60 s for high-latency async hops.
decision_wait: 30s
# Max concurrent traces held in the decision cache.
# Exceeding this triggers LRU eviction — size carefully.
num_traces: 50000
# Expected throughput for cache pre-allocation.
expected_new_traces_per_sec: 10000
policies:
# 1. Deterministic error retention — highest priority.
# Keeps any trace containing at least one ERROR span.
- name: keep-errors
type: status_code
status_code:
status_codes: [ERROR]
# 2. Latency SLO breach capture.
# Retains traces where the root span exceeds 2 000 ms.
- name: keep-slow
type: latency
latency:
threshold_ms: 2000
# 3. Business-attribute matching (tenant tier, payment flows).
# Use coarse tags — avoid PII and high-cardinality fields.
- name: keep-critical-tenant
type: string_attribute
string_attribute:
key: tenant_tier
values: ["enterprise", "vip"]
# 4. Probabilistic fallback for baseline traffic coverage.
# Fires only when no higher-priority policy matched.
- name: probabilistic-fallback
type: probabilistic
probabilistic:
sampling_percentage: 5
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling]
exporters: [otlp]
Cache sizing formula:
Memory (GB) ≈ (traces_per_sec × decision_wait_sec × avg_span_bytes) / 1_073_741_824
Example: 10 000 traces/sec × 30 s × 5 120 bytes ≈ 1.43 GB baseline. Add 20–30 % for Go runtime and policy evaluation overhead. Provision your collector pods accordingly; OOM during a traffic spike silently evicts pending traces.
Async boundary edge case: Kafka consumer span loss
Asynchronous message brokers are the most common source of broken trace continuity under head-based sampling. A producer runs at 10 % and publishes a message with a traceparent header. An independently deployed consumer runs at 1 %. The consumer processes a message that triggers a database timeout. Because the consumer’s head-based sampler makes a fresh keep/drop decision on the extracted W3C TraceContext context, it drops the span — and the error disappears from storage.
Diagnosing this:
- Query storage for partial traces: filter for
span_count < expected_service_countgrouped bytrace_id. Identify gaps where the consumer hop is absent. - Inspect broker message metadata for
traceparentandtracestateheaders. A missing or malformedtraceparentbreaks the parent-child linkage entirely; the consumer span becomes an orphan with a new roottrace_id. - Cross-reference application logs containing the database error with the missing
trace_idvalues. If logs exist but spans do not, the consumer’s sampler is the culprit.
Tail-based sampling resolves this because the collector receives all spans — producer and consumer — and evaluates the complete trace once the decision_wait window closes. The async gap is transparent to the retention policy.
Decision rules
Use tail-based sampling when:
- Your service error rate is below 1 % and head-based probabilistic rates routinely discard those failures before storage.
- You need guaranteed capture of P99 latency outliers for SLO accountability across service boundaries.
- You operate async consumers (Kafka, SQS, RabbitMQ) at independent deployment cadences, making consistent head-based rates across the trace impossible to enforce.
- Compliance or audit requirements mandate 100 % retention of traces carrying specific tenant or transaction identifiers.
Continue with head-based sampling when:
- Your primary goal is cost control on background health-check or metrics-scrape traffic where errors are irrelevant.
- Your infrastructure cannot accommodate a stateful collector buffer (edge deployments, memory-constrained containers).
- Your error rate is high enough (> 5 %) that probabilistic sampling already captures a statistically useful sample of failures.
Common pitfalls
- Misordering policies. Placing the
probabilisticfallback before deterministic rules causes the fallback to consume policy evaluation time before the error and latency policies execute, and can lead to errors being evaluated under the probabilistic rate rather than guaranteed retention. - Undersizing the cache. Setting
num_tracestoo low triggers LRU eviction of in-flight traces during spikes. The evicted traces are silently dropped — exactly the data loss tail-based sampling is meant to prevent. Use the sizing formula above and monitorotelcol_processor_tail_sampling_num_traces_on_decision_service. - Matching on PII or high-cardinality attributes. The
string_attributepolicy evaluates raw span metadata. Avoiduser_id, email, or IP address fields; use hashed tenant identifiers or coarse-grained service tags to prevent memory bloat and comply with data-governance boundaries. See security boundaries in distributed tracing for attribute-level PII controls.
Troubleshooting FAQ
Why are errors still missing after enabling tail-based sampling?
Check that the keep-errors policy is positioned first in the policies list and that your SDK is emitting spans with StatusCode = ERROR (not just setting an error log). Use otelcol_processor_tail_sampling_sampling_policy_evaluation_errors to detect evaluation failures.
How do I confirm the decision window is long enough?
Inject a synthetic request that traverses every service hop and measure the time from the first span start to the last span end using your Jaeger or Tempo UI. Set decision_wait to at least that duration plus 10 s of network jitter margin.
What happens during a collector restart? In-memory buffered traces are lost. During rolling restarts, route traffic back to a head-based sampler at the SDK layer until the new collector pod is ready to accept spans. Document this in your SRE runbook alongside the tail-sampling circuit-breaker fallback.
Related
- Choosing Between Head-Based and Tail-Based Sampling — full strategy comparison with SDK configuration examples
- Trace Storage Backend Comparison: Jaeger vs Tempo — aligning your backend’s retention TTL with the tail-sampling decision window
- Propagating Trace Context Through Kafka Consumers — fixing async boundary span loss at the SDK propagation layer
- Security Boundaries in Distributed Tracing — PII controls for span attributes used in sampling policies
↑ Back to Choosing Between Head-Based and Tail-Based Sampling