Trace Storage Backend Comparison: Jaeger vs Tempo

Problem Framing

Your observability stack ingests spans reliably, but production queries are timing out, storage bills are growing faster than your trace volume, or a compliance audit has flagged retention gaps. The root cause is usually not instrumentation — it is a mismatch between your chosen storage backend and the workload it was not designed for. Jaeger paired with Elasticsearch works well for rich ad-hoc tag queries but collapses under write amplification at scale. Grafana Tempo eliminates the indexing bottleneck entirely but requires a different query mental model. Choosing the wrong backend at this juncture means either a painful re-migration later or silent span loss that undermines every sampling decision you have made.

Prerequisites

Before evaluating backends, ensure the following are in place:

  • OpenTelemetry SDK initialised in your services and exporting via OTLP (see OpenTelemetry SDK Setup for Backend Services)
  • W3C TraceContext propagation configured across all service boundaries so traceID is consistent end-to-end
  • OpenTelemetry Collector deployed as an intermediary (version ≥ 0.90.0) — both backends receive spans over OTLP gRPC
  • Baseline metrics: peak spans-per-second (SPS), average spans-per-trace, and current storage IOPS/cost
  • Network access to an S3-compatible object store (for Tempo) or an Elasticsearch/Cassandra cluster (for Jaeger)

Concept Deep-Dive: Storage Engine Architecture

The architectural divergence between the two backends is fundamental. Understanding it prevents operational surprises.

Jaeger: Index-First, Database-Backed

Jaeger pairs an Elasticsearch or Cassandra primary store with a mandatory indexing layer. Every ingested span triggers a dual write: one to the primary document store (full span payload) and one to the inverted index (tag keys, tag values, duration buckets, service names). This index-first model is what enables Jaeger’s rich filtering — you can query error=true AND http.method=POST AND duration>500ms across billions of spans — but it comes with three structural costs:

  1. Write amplification: 2–3× storage writes per span during peak ingestion.
  2. Non-linear query latency: P99 latency grows with index size; shard rebalancing during high-ingest windows can cause multi-second stalls.
  3. Operational surface: JVM heap tuning, shard rebalancing, ILM policy coordination, and hot/warm tier management require dedicated SRE attention.

Tempo: ID-First, Object-Storage-Native

Tempo stores spans directly in S3, GCS, or Azure Blob Storage, serialised as columnar Parquet blocks. Trace lookup is keyed strictly by traceID — the same 128-bit identifier propagated via traceparent. There is no per-attribute inverted index in the default configuration. This eliminates write amplification and makes per-span ingestion cost nearly constant regardless of attribute cardinality, but tag-based filtering requires either TraceQL (Tempo ≥ 2.0 with vParquet3 block format) or an external discovery path — for example, using Prometheus exemplars or Loki log correlation to obtain a traceID and then fetching the full trace from Tempo by ID.

The SVG below illustrates how the two write paths diverge at the ingester layer:

Jaeger vs Tempo write path architecture Left side shows Jaeger's dual-write path: spans flow from OTLP Collector to Jaeger Collector, then fan out to both a primary document store and an inverted index. Right side shows Tempo's single write path: spans flow from OTLP Collector to Tempo Distributor to Ingester, then flush as a Parquet block to object storage with no separate index. Jaeger (index-backed) Tempo (object-storage-native) OpenTelemetry Collector Jaeger Collector dual write (2-3× amplification) ES / Cassandra (primary documents) Inverted Index (tags, durations, services) Jaeger Query UI / API tag-filter + traceID lookup Tempo Distributor Ingester (in-memory) block flush S3 / GCS / Blob Parquet blocks — traceID keyed

Step-by-Step Implementation

Step 1: Audit Infrastructure and Establish a Cost Baseline

Before routing a single span to either backend, measure what you have:

  • Count existing S3/GCS buckets and their lifecycle tiers (Standard, Infrequent Access, Archive).
  • Record baseline Elasticsearch IOPS, shard count, and heap pressure (_cat/nodes?v&h=heap.percent).
  • Calculate peak ingestion: instrument your collector pipeline with otelcol_processor_batch_batch_size_trigger_send to measure actual spans-per-second.
  • Estimate cost: Elasticsearch at 50 GB/day costs roughly 3–5× more than equivalent Tempo Parquet blocks in S3 Standard, before accounting for replica shards.

Step 2: Configure Dual-Backend OTLP Routing for Evaluation

Run both backends in parallel during your evaluation window. The OpenTelemetry Collector handles dual-export transparently:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    send_batch_max_size: 512      # limits backpressure; dual-write adds ~15-20% egress
    timeout: 5s

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: false
  otlp/tempo:
    endpoint: tempo-distributor:4317
    tls:
      insecure: false

service:
  pipelines:
    traces/evaluation:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger, otlp/tempo]

Keep dual-write active for at least one full business-day cycle so you capture peak-load behaviour in both backends simultaneously.

Step 3: Configure the Go SDK for OTLP Export

The SDK is backend-agnostic — it sends to the Collector, which routes onward. Only the Collector config changes between backends. A production-ready Go tracer setup:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/resource"
    semconv "go.opentelemetry.io/otel/semconv/v1.27.0"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    "google.golang.org/grpc/credentials"
)

func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
    // Export to the Collector; the Collector fans out to the chosen backend(s).
    exp, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("collector:4317"),
        // Enforce TLS — never send spans over plaintext in production.
        otlptracegrpc.WithTLSCredentials(credentials.NewClientTLSFromCert(nil, "")),
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        // Batcher reduces connection churn by ~80% vs synchronous export.
        sdktrace.WithBatcher(exp),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("payment-service"),
        )),
    )
    otel.SetTracerProvider(tp)
    // W3C TraceContext ensures consistent traceID across service boundaries.
    otel.SetTextMapPropagator(propagation.TraceContext{})
    return tp, nil
}

The Batcher introduces ~50–100 ms export latency but avoids per-span TCP overhead that would otherwise saturate collector ingress at high SPS.

Step 4: Align Sampling Strategy with Backend Constraints

Storage costs scale directly with ingestion volume, making sampling alignment a prerequisite rather than an afterthought. For a full treatment of the trade-offs, see Choosing Between Head-Based and Tail-Based Sampling.

Head-based sampling at the SDK layer caps baseline ingestion for both backends:

  • Java (Spring Boot): otel.traces.sampler=parentbased_traceidratio, otel.traces.sampler.arg=0.1
  • Node.js: new ParentBasedSampler({ root: new TraceIdRatioBasedSampler(0.1) })

Tail-based sampling in the Collector retains 100% of error and slow traces regardless of the head-based rate. Tail sampling runs in the Collector, not inside Tempo or Jaeger — both backends simply receive whatever the Collector forwards:

processors:
  tail_sampling:
    decision_wait: 15s            # hold spans this long before deciding
    policies:
      - name: always-sample-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: always-sample-slow
        type: latency
        latency: { threshold_ms: 2000 }
      - name: default-probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/tempo]   # swap for otlp/jaeger during evaluation

Step 5: Configure Retention, Compaction, and Lifecycle Management

Retention mechanics diverge sharply. Jaeger manages lifecycle through Elasticsearch ILM index rotation; Tempo uses block-based compaction against object storage.

Jaeger ILM policy (Elasticsearch _ilm/policy/jaeger-traces):

# elasticsearch ILM policy for Jaeger trace indices
policy:
  phases:
    hot:
      min_age: "0ms"
      actions:
        rollover:
          max_primary_shard_size: "50gb"
          max_age: "2d"
    warm:
      min_age: "2d"
      actions:
        shrink:
          number_of_shards: 1
        forcemerge:
          max_num_segments: 1
    delete:
      min_age: "14d"
      actions:
        delete: {}

For compliance tiers, apply separate ILM policies per index alias (e.g., jaeger-pci-* with 365-day delete phase, jaeger-internal-* with 14-day delete phase).

Tempo block retention and compaction (tempo.yaml):

storage:
  trace:
    backend: s3
    s3:
      bucket: "traces-prod"
      region: "us-east-1"

compactor:
  compaction:
    compaction_window: 1h
    block_retention: 336h       # 14 days; block_retention lives under compactor.compaction
    max_compaction_objects: 6000000

# For compliance-tier multi-tenancy, override per tenant:
# overrides:
#   "pci-tenant":
#     block_retention: 8760h   # 365 days

Object storage lifecycle rules are eventually consistent. Always verify compaction progress via the tempo_compactor_compaction_duration_seconds metric before marking a retention window as enforced. For Jaeger, configuring retention policies for compliance requires coordinating index rollover with compaction to avoid hot-tier exhaustion.

Step 6: Execute Zero-Downtime Migration

Transition from any legacy agent (Zipkin, Jaeger Thrift, proprietary) to OTLP using a phased Collector approach:

receivers:
  zipkin:
    endpoint: 0.0.0.0:9411      # legacy Zipkin HTTP receiver
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  transform:
    trace_statements:
      - context: span
        statements:
          # Tag spans so you can filter migration-sourced data during validation
          - set(attributes["migration.source"], "zipkin") where attributes["http.status_code"] != nil
  batch:
    send_batch_max_size: 512

exporters:
  otlp/legacy:
    endpoint: "jaeger-collector:14250"
    tls:
      insecure: true             # legacy endpoint; terminate TLS at Collector ingress
  otlp/target:
    endpoint: "tempo-distributor:4317"
    tls:
      insecure: false

service:
  pipelines:
    traces/migration:
      receivers: [zipkin, otlp]
      processors: [transform, batch]
      exporters: [otlp/legacy, otlp/target]

Migration steps in order:

  1. Deploy the Collector as a DaemonSet or sidecar with dual-export routing active.
  2. Standardise on 128-bit hex traceID generation; disable legacy ID truncation on Zipkin producers.
  3. Gradually shift read traffic to the target backend using feature flags or API gateway routing, validating query parity and latency SLAs.
  4. Decommission legacy agents after validation, monitoring for orphaned spans — spans whose parentSpanId does not resolve to a known span in the target backend.

Verification

Confirm the backend is receiving and storing traces correctly before decommissioning the evaluation period:

# Verify Tempo is ingesting: check distributor received bytes
curl -s http://tempo:3200/metrics | grep tempo_distributor_bytes_received_total

# Look up a known traceID directly via Tempo HTTP API
curl "http://tempo:3200/api/traces/<traceID>"

# Verify Jaeger query returns the same trace
curl "http://jaeger-query:16686/api/traces/<traceID>"

# Check for compaction errors in Tempo
curl -s http://tempo:3200/metrics | grep tempo_compactor_compaction_errors_total

For Jaeger, open the UI at http://jaeger-query:16686 and run a tag-based search (error=true) spanning the evaluation window. Compare the result count against the equivalent Tempo TraceQL query ({ .error = true } in Grafana → Explore → Tempo).

Decision Matrix

Workload Profile Recommended Backend Rationale
High-throughput microservices (>50k SPS) Tempo Object storage scales linearly; ID-based lookup avoids ES shard bottlenecks. Lower TCO at massive ingestion volumes.
Long-retention compliance (1–3+ years) Tempo Native integration with S3/GCS lifecycle policies. Block compaction reduces footprint without active DB maintenance.
Rich ad-hoc tag filtering and debugging Jaeger Elasticsearch indexing enables complex boolean queries across arbitrary span attributes without pre-aggregation.
Cloud-native / Kubernetes-first Tempo Stateless distributors and ingesters run efficiently as pods. No external DB cluster dependency.
Hybrid / on-premises legacy environments Jaeger Mature Cassandra/ES deployments already exist in many enterprises. Easier to integrate with existing SIEM and log pipelines.
Low operational maturity / small teams Tempo Managed object storage plus stateless collectors reduce operational overhead. Fewer components to monitor and tune.

Edge Cases and Gotchas

  1. Tempo max_duration drops long-running async traces. The default 24-hour trace window means spans from overnight batch jobs arrive after the trace window closes and are silently discarded. Increase max_duration in the distributor config or implement span buffering in the Collector tail_sampling processor using decision_wait equal to the maximum expected workflow duration. See handling async boundaries for context on how async execution models interact with trace duration assumptions.

  2. Jaeger high-cardinality attribute explosion. Indexing user IDs, request bodies, or full URL paths causes unbounded ES shard growth. Set jaeger.es.tags-as-fields.all=false and enumerate only the attributes you routinely filter on. This is a span lifecycle concern: attribute cardinality decisions made at instrumentation time have direct storage consequences.

  3. Async boundary traceID fragmentation. Message queues and async workers break synchronous request boundaries. Inject traceparent and tracestate into Kafka message headers or RabbitMQ properties at the producer. Both backends handle out-of-order span arrival, but Tempo batches spans by traceID before flushing to object storage — highly fragmented async traces may span multiple Parquet blocks, increasing query latency.

  4. Dual-write evaluation overhead is non-trivial. Routing to both backends simultaneously adds ~15–20% network egress from the Collector. Set send_batch_max_size: 512 on the batch processor and monitor otelcol_exporter_queue_size on both exporters to detect backpressure before it causes drops.

  5. Legacy endpoint TLS gaps. Jaeger Thrift endpoints on port 14250 often lack mTLS. Terminate TLS at the Collector ingress and enforce strict Kubernetes NetworkPolicy to prevent plaintext span leakage to the legacy backend during the migration window.

  6. Tempo tag search requires explicit schema enablement. Out-of-the-box Tempo does not index span attributes for tag-based search. Upgrading to the vParquet3 block format and enabling TraceQL is a non-trivial block migration; plan a maintenance window and validate existing blocks are rewritable before promoting to production.

Performance and Scale Notes

Jaeger with Elasticsearch:

  • Write amplification peaks at 3× for high-attribute-density spans. Reduce by restricting the indexed tag allowlist and using bulk indexing with refresh_interval: 30s.
  • ES JVM heap should be capped at 50% of instance RAM; at 32 GB, heap pressure above 75% degrades indexing throughput by up to 40%.
  • P99 query latency degrades super-linearly beyond ~500 M indexed spans per index; enforce ILM rollover aggressively.

Tempo with object storage:

  • Ingester memory usage scales with the number of open trace blocks. Each active trace consumes ~2 KB of ingester heap. At 100k concurrent open traces, budget ~200 MB per ingester pod.
  • Parquet columnar encoding typically achieves 8–12× compression versus raw JSON span payloads, substantially lowering S3 storage costs.
  • TraceQL queries (Tempo ≥ 2.0) perform columnar scans over Parquet blocks; query latency scales with block count, not attribute cardinality. Compact blocks aggressively (compaction_window: 1h) to keep block counts low.
  • Object storage GET request costs can accumulate on high-frequency traceID lookups. Cache the Tempo query frontend with a max_cache_freshness: 10m setting for read-heavy Grafana dashboards.

Troubleshooting FAQ

Why do I get zero results searching by tag in Tempo?

Tempo’s default deployment does not index span attributes for tag search. You must enable TraceQL search (requires Tempo ≥ 2.0 with a vParquet3 block format) or use exemplar-based trace discovery from Grafana dashboards — Prometheus scrapes RED metrics from the Tempo MetricsGenerator, dashboards surface exemplars, and clicking an exemplar resolves a traceID that Tempo fetches by ID.

Jaeger shows “index too large” errors — what causes this?

Elasticsearch shard bloat occurs when high-cardinality span attributes (user IDs, request bodies, full URL paths) are indexed without a field allowlist. Set jaeger.es.tags-as-fields.all=false and enumerate only the attributes you query. Also enforce ILM rollover at 50 GB shards to prevent hot-tier exhaustion.

Tempo is dropping spans from long-running async jobs — how do I fix it?

Tempo enforces max_duration per trace (default 24 h). Spans arriving after the trace window closes are silently dropped. Increase max_duration in the distributor config, or implement span buffering at the OpenTelemetry Collector using the tail_sampling processor’s decision_wait window set to the maximum expected workflow duration.

How do I verify trace completeness after migrating backends?

Cross-reference sampled trace counts against application error logs and synthetic monitor baselines. Query both backends for the same traceID during the dual-write window and diff the span counts. Orphaned spans — spans whose parentSpanId does not resolve to a known span — indicate dropped root spans or incomplete context propagation.

Can Jaeger and Tempo receive the same trace simultaneously?

Yes. Configure the Collector with two named OTLP exporters (otlp/jaeger and otlp/tempo) in the same pipeline. Both receive identical spans. The overhead is ~15–20% extra network egress from the Collector; use a batch processor with send_batch_max_size: 512 to limit backpressure.


Related

↑ Back to Distributed Tracing Fundamentals & Architecture