Encrypting Trace Payloads at Rest and in Transit

Disable OTEL_EXPORTER_OTLP_INSECURE, enforce mTLS at the collector ingress, and mandate KMS-backed server-side encryption at the storage layer — these three controls together close every known plaintext leakage path in an OpenTelemetry pipeline.

Context and When It Matters

OpenTelemetry SDKs batch spans in memory and flush them over HTTP or gRPC to the collector. When TLS negotiation fails — expired certificates, mismatched CAs, load balancer stripping ALPN — many SDK implementations silently degrade to plaintext to preserve telemetry continuity. This fallback is often enabled by default.

The compliance impact is severe. Span attributes routinely carry sensitive context: user identifiers, database queries, JWT fragments, and internal service tokens. When these payloads traverse unencrypted channels or persist in plaintext collector spillover directories, they violate data-residency mandates (HIPAA, PCI-DSS, SOC 2) and expose the pipeline to credential harvesting. Within the broader framework of security boundaries in distributed tracing, observability pipelines are frequently treated as secondary infrastructure during rapid scaling — leaving encryption controls deprioritised until audit findings surface.

The Async Export Vulnerability — Illustrated

The diagram below shows the three points where plaintext leakage commonly occurs: the SDK export fallback, the collector batch spillover directory, and unencrypted object writes to the storage backend.

OpenTelemetry pipeline encryption gaps Diagram showing an instrumented service exporting spans through the OpenTelemetry Collector to a storage backend, with three labelled leakage points: SDK TLS fallback to plaintext HTTP, collector batch processor spill to an unencrypted disk directory, and unencrypted object writes to S3 or Elasticsearch. Instrumented Service OTLP HTTP/gRPC ① SDK fallback to plaintext OTel Collector receiver → batch → exporter Spillover dir /var/otel/buffer ② Unencrypted disk spill object write Storage Backend S3 / Elasticsearch ③ No SSE-KMS on object write Data path Leakage point

Core Mechanism: Closing Each Leakage Point

Leakage Point 1 — SDK Plaintext Fallback (In-Transit)

Set these environment variables before your service starts. Each controls a distinct part of the TLS handshake:

# Force the HTTPS/gRPC+TLS endpoint — never http://
export OTEL_EXPORTER_OTLP_ENDPOINT=https://collector.internal:4318

# CA that signed the collector's server certificate
export OTEL_EXPORTER_OTLP_CERTIFICATE=/etc/ssl/certs/ca-bundle.crt

# mTLS: present a client certificate so the collector can authenticate the SDK
export OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE=/etc/otel/mtls/client.crt
export OTEL_EXPORTER_OTLP_CLIENT_KEY=/etc/otel/mtls/client.key

# CRITICAL: fail-fast instead of falling back to plaintext
export OTEL_EXPORTER_OTLP_INSECURE=false

Pair TLS enforcement with bounded retry semantics so a certificate rotation does not cause unbounded memory growth. In Go:

import (
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "time"
)

exporter, err := otlptracehttp.New(ctx,
    otlptracehttp.WithRetry(otlptracehttp.RetryConfig{
        Enabled:         true,
        InitialInterval: 1 * time.Second,
        MaxInterval:     10 * time.Second,
        MaxElapsedTime:  30 * time.Second, // give up, do not buffer indefinitely
    }),
)

In Node.js (@opentelemetry/exporter-trace-otlp-http), set timeoutMillis: 5000 and implement a custom RetryPolicy — without a deadline the exporter will queue spans through certificate-rotation windows, eventually exhausting process heap.

Overhead note. TLS handshakes add roughly 1–3 ms per connection. Because OTLP exporters batch spans (default: 512 spans or 100 ms whichever comes first), the per-span overhead is under 0.01 ms. CPU impact stays below 2 % on modern x86/ARM hardware.

Leakage Point 2 — Collector Batch Spillover (Disk)

When the batch processor queue exceeds send_batch_size or timeout, the SDK can spill spans to a local directory. If that directory lacks filesystem encryption (dm-crypt or LUKS on Linux, or a cloud provider encrypted volume), spans persist in plaintext until the next flush.

Mitigations:

  • Mount the spillover path on an encrypted volume (/dev/mapper/otel-data).
  • Set send_batch_max_size low enough that back-pressure propagates to callers before the queue overflows — a sane default is 10,000 spans.
  • Configure the collector’s memory_limiter processor ahead of batch to shed load gracefully under memory pressure rather than spilling.

Leakage Point 3 — Backend Storage Without SSE (At-Rest)

Tempo (S3 backend)

storage:
  trace:
    backend: s3
    s3:
      endpoint: s3.us-east-1.amazonaws.com
      bucket: otel-traces-prod
      sse_kms:
        key_id: arn:aws:kms:us-east-1:123456789012:key/abcd-1234-efgh-5678
        encryption: true

Jaeger with Elasticsearch

Elasticsearch does not have a native xpack.security.encryption block for AES selection — at-rest encryption is handled by the underlying volume. Enable node-level TLS and authentication first:

# elasticsearch.yml
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch/certs/elastic-certificates.p12
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: /etc/elasticsearch/certs/http.p12

Then encrypt at the volume level: AWS EBS with a KMS CMK, GCP CMEK, or Azure Disk Encryption. This separates the encryption boundary from Elasticsearch itself, which means key rotation does not require index rebuilds.

IAM least-privilege for the collector’s service account:

  • Grant kms:GenerateDataKey and kms:GenerateDataKeyWithoutPlaintext.
  • Explicitly deny kms:Decrypt — the collector is a write-only participant; it should never be able to read encrypted trace objects even if compromised.

Overhead note. SSE-KMS adds 15–50 ms per object write. Tempo and Jaeger batch 10–100 MB blocks before uploading, amortising KMS round-trips to under 0.5 ms per span. Watch storage_write_latency_seconds in your collector metrics dashboard to detect KMS throttling early.

Collector Receiver TLS Configuration

The SDK-side environment variables are meaningless if the collector accepts plaintext connections. Lock down both protocols explicitly:

# otelcol-config.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
        tls:
          cert_file: /etc/otelcol/server.crt
          key_file:  /etc/otelcol/server.key
          ca_file:   /etc/otelcol/ca.crt
          client_ca_file: /etc/otelcol/client-ca.crt  # mTLS: require client cert
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: /etc/otelcol/server.crt
          key_file:  /etc/otelcol/server.key
          ca_file:   /etc/otelcol/ca.crt
          client_ca_file: /etc/otelcol/client-ca.crt

Omitting either tls block causes the receiver to accept plaintext connections regardless of the SDK configuration — a common misconfiguration during collector upgrades when config files are regenerated from templates.

Verification Checklist

Use these steps in order. Each one targets a specific layer of the pipeline:

Common Pitfalls

  • Self-signed CA mounted in the app but not in the collector. The handshake fails, the SDK logs a warning, and — if OTEL_EXPORTER_OTLP_INSECURE is not explicitly false — silently continues over plaintext. Always rotate CA changes to both sides simultaneously.
  • Cross-VPC DNS resolving to a public ALB. TLS terminates at the ALB, leaving intra-VPC collector traffic unencrypted. Use AWS PrivateLink or GCP Private Service Connect, and pin OTEL_EXPORTER_OTLP_ENDPOINT to the private DNS name.
  • KMS key policy missing GenerateDataKeyWithoutPlaintext. Tempo uses this operation for envelope encryption. Without it, every S3 write fails silently and Tempo falls back to unencrypted uploads rather than returning an error.

Troubleshooting FAQ

Why does my collector start successfully but spans never arrive? The most common cause is a client_ca_file on the receiver that does not include the CA that signed the SDK’s client certificate. The handshake fails at the mutual-auth step. Inspect collector logs for tls: certificate required or tls: bad certificate.

Why does SSE-KMS appear in CloudTrail but Tempo still writes unencrypted objects? Tempo writes a tempo.index sidecar alongside block data. If the sse_kms block is set but encryption: true is omitted (it defaults to false in some Tempo versions), the index is written in plaintext. Confirm encryption: true is present and rebuild the index.

How do I confirm mTLS is enforced and not just TLS? Run openssl s_client -connect collector.internal:4318 without supplying -cert and -key. A properly configured mTLS endpoint closes the connection with tls: certificate required. If it returns a successful handshake with no client certificate, mTLS is not enforced.


Related

↑ Back to Security Boundaries in Distributed Tracing