Security Boundaries in Distributed Tracing
Uncontrolled trace propagation is a data-leak vector that most teams discover at the worst possible moment — during a compliance audit or a post-incident review. When a microservice passes traceparent headers straight through to a third-party analytics sink, it carries with it every span attribute that touched the request: database queries, JWT fragments, user email addresses, and internal service topology. The impact spreads silently because observability tooling is rarely subject to the same access controls as application data stores.
This page maps the concrete mechanisms that enforce security boundaries inside a tracing pipeline: trust zone classification, SDK-level attribute redaction, sanitized async propagation, mTLS-hardened transport, and zero-trust query governance. Each section targets a specific failure mode and provides production-ready implementation steps.
Problem Framing
The typical symptom that drives teams to this topic is a compliance finding: a PCI-DSS or HIPAA auditor identifies that db.statement, http.request.body, or user.id span attributes are accessible to engineers who have no business need to see them. Sometimes the issue is narrower — a Kafka consumer downstream of a zone boundary is generating child spans linked to trace IDs that originated in a network segment it should be isolated from, effectively re-assembling traces that should never span that boundary.
Both problems share the same root cause: telemetry pipelines are designed for observability throughput, not data classification. Span attributes flow wherever the exporter points them, and context headers propagate wherever the HTTP client forwards them, unless something in the pipeline explicitly stops them.
Prerequisites
Before implementing the controls below, ensure:
- OpenTelemetry SDK is initialized in each service with a configurable
SpanProcessorchain (not just the defaultSimpleSpanProcessor). - Services emit spans via OTLP (gRPC or HTTP) to a centralized OpenTelemetry Collector, not directly to a backend.
- You have identified which span attributes contain PII, PCI-scoped, or compliance-restricted data.
- Network zones are labelled (e.g.,
public,internal,restricted,partner) and services carry a zone tag as a resource attribute. - Collector version 0.90+ (for the
attributesprocessor hash action androutingconnector support).
Trust Zone Architecture
Before writing code, model where trace context crosses security perimeters. The diagram below shows a typical four-zone layout and the controls that activate at each boundary.
Mapping Trust Zones to Telemetry Data Flows
Every trace-bearing request crosses at least one security perimeter. The framework below makes those crossings explicit before any code is written.
- Inventory communication paths. List all service-to-service connections and classify each endpoint by trust level:
public,internal,restricted, orpartner. Services in the restricted zone include payment processors, PII stores, and healthcare data services. - Map headers to policies. Align W3C TraceContext headers (
traceparent,tracestate) and custom baggage keys with your data classification matrix. Define which keys are permitted to cross each zone boundary. - Define boundary rules. For each zone transition, establish an explicit action for every header and attribute class:
allow,sanitize,drop, ormask. For example: stripuser.email,payment.card_last4, anddb.statementattributes when flushing spans from the internal mesh to a third-party analytics sink.
Document these rules in version-controlled configuration — not in comments inside SpanProcessor code — so they are auditable and reviewable independently of the instrumentation code.
Step-by-Step Implementation
Step 1 — Deploy a Custom SpanProcessor for Attribute Redaction
OpenTelemetry SDKs provide a SpanProcessor interface with hooks that fire during span lifecycle. Attribute redaction must happen in OnStart (Go), before span data is written to the read-only export path. Relying on the Collector’s attributes processor is not sufficient as a sole control: the SDK-to-Collector channel carries the unredacted payload.
Go: Attribute redaction in OnStart
package tracing
import (
"context"
"go.opentelemetry.io/otel/attribute"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
)
// BoundaryRedactor strips or masks attributes that must not cross zone boundaries.
// It implements sdktrace.SpanProcessor and should be registered before the BatchSpanProcessor.
type BoundaryRedactor struct {
// blockedAttrs contains attribute keys whose values must be replaced.
blockedAttrs map[attribute.Key]bool
}
func NewBoundaryRedactor(blocked []string) *BoundaryRedactor {
m := make(map[attribute.Key]bool, len(blocked))
for _, k := range blocked {
m[attribute.Key(k)] = true
}
return &BoundaryRedactor{blockedAttrs: m}
}
// OnStart fires on a mutable ReadWriteSpan — this is the only safe redaction point in Go.
func (r *BoundaryRedactor) OnStart(_ context.Context, s sdktrace.ReadWriteSpan) {
original := s.Attributes()
redacted := make([]attribute.KeyValue, 0, len(original))
for _, kv := range original {
if r.blockedAttrs[kv.Key] {
redacted = append(redacted, attribute.String(string(kv.Key), "[REDACTED]"))
} else {
redacted = append(redacted, kv)
}
}
s.SetAttributes(redacted...)
}
func (r *BoundaryRedactor) OnEnd(_ sdktrace.ReadOnlySpan) {}
func (r *BoundaryRedactor) Shutdown(_ context.Context) error { return nil }
func (r *BoundaryRedactor) ForceFlush(_ context.Context) error { return nil }
Registration (SDK initialization)
tp := sdktrace.NewTracerProvider(
sdktrace.WithSpanProcessor(NewBoundaryRedactor([]string{
"user.email",
"http.request.body",
"db.statement",
"payment.card_last4",
})),
sdktrace.WithBatcher(otlpExporter),
)
Python: SpanExporter wrapper for attribute filtering
In Python, the cleanest insertion point is a wrapping SpanExporter that filters before forwarding, because the Python SDK’s on_start hook does not expose span mutation in all versions:
from opentelemetry.sdk.trace.export import SpanExporter, SpanExportResult
BLOCKED_ATTRS = frozenset(["user.email", "db.statement", "http.request.body"])
class RedactingExporter(SpanExporter):
def __init__(self, delegate: SpanExporter):
self._delegate = delegate
def export(self, spans):
cleaned = []
for span in spans:
# SpanData is immutable; rebuild with filtered attributes
filtered = {
k: "[REDACTED]" if k in BLOCKED_ATTRS else v
for k, v in span.attributes.items()
}
# MutableSpan from opentelemetry-sdk allows attribute replacement
cleaned.append(span._replace(attributes=filtered))
return self._delegate.export(cleaned)
def shutdown(self):
self._delegate.shutdown()
Step 2 — Configure a Boundary-Aware Propagator
A malformed or adversarially crafted traceparent can link spans across traces the attacker controls, inflate cardinality, or inject invalid state. Validate incoming headers before creating child spans.
import re
from opentelemetry.propagators.textmap import TextMapPropagator, CarrierT, Getter
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
# W3C traceparent spec: 00-<32hex>-<16hex>-<2hex>
_TRACEPARENT_RE = re.compile(
r"^00-[0-9a-f]{32}-[0-9a-f]{16}-[0-9a-f]{2}$"
)
class ValidatingPropagator(TextMapPropagator):
"""Wraps the standard TraceContext propagator with strict format validation."""
def __init__(self):
self._inner = TraceContextTextMapPropagator()
def extract(self, carrier: CarrierT, context=None, getter: Getter = None):
raw = (getter or DefaultGetter()).get(carrier, "traceparent")
if raw and not _TRACEPARENT_RE.match(raw[0]):
# Non-compliant header: start a fresh root span, do not link to attacker context
return context
return self._inner.extract(carrier, context, getter)
def inject(self, carrier, context=None, setter=None):
self._inner.inject(carrier, context, setter)
@property
def fields(self):
return self._inner.fields
Step 3 — Sanitize Context at Async and External Boundaries
Asynchronous messaging systems like Kafka and SQS break synchronous trace continuity, but trace context often persists in message headers long after the originating request has completed. This creates two risks: stale context linking unrelated spans, and high-sensitivity baggage reaching consumers that should never see it.
Java: Kafka producer with boundary-safe context injection
import org.apache.kafka.clients.producer.ProducerRecord;
import io.opentelemetry.api.trace.SpanContext;
public class BoundaryAwareKafkaPublisher {
/**
* Publishes a Kafka record carrying only a sanitized correlation ID.
* The raw traceparent is deliberately not propagated to prevent cross-zone
* trace linking and baggage leakage.
*/
public void publishWithCorrelation(
ProducerRecord<String, String> record,
SpanContext parentCtx) {
if (parentCtx.isValid()) {
// Extract only the trace-id (32 hex chars); omit flags and tracestate
String correlationId = parentCtx.getTraceId();
record.headers().add(
"x-correlation-id",
correlationId.getBytes(StandardCharsets.UTF_8)
);
}
// Explicitly do NOT inject traceparent — the consumer will start a new root span
// and link it to the correlation ID via a linked context.
producer.send(record, (meta, ex) -> {
if (ex != null) {
log.error("Publish failed for correlation {}", correlationId, ex);
}
});
}
}
Sampling decisions at async boundaries compound the security calculus: over-sampling high-sensitivity queues increases both data exposure and storage costs simultaneously. Apply targeted sampling rates at async boundaries to limit trace volume from restricted zones.
Step 4 — Harden OTLP Transport and Collector Pipelines
Telemetry data in motion and at rest requires cryptographic controls equivalent to those applied to application data. OTLP endpoints without mTLS accept spans from any client that can reach the collector port, making them trivial ingest targets.
OpenTelemetry Collector: mTLS and attribute redaction
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
tls:
cert_file: /etc/otel/certs/server.crt
key_file: /etc/otel/certs/server.key
client_ca_file: /etc/otel/certs/ca.crt # enforces mTLS
http:
endpoint: 0.0.0.0:4318
tls:
cert_file: /etc/otel/certs/server.crt
key_file: /etc/otel/certs/server.key
processors:
attributes:
actions:
# Hash db.statement for debugging utility without exposing raw SQL
- key: "db.statement"
action: hash
# Hard-delete fields that must never reach the backend
- key: "http.request.body"
action: delete
- key: "user.password"
action: delete
batch:
send_batch_size: 1024
timeout: 5s
exporters:
otlp/tempo:
endpoint: "tempo.internal:4317"
tls:
insecure: false
cert_file: /etc/otel/certs/client.crt
key_file: /etc/otel/certs/client.key
service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes, batch]
exporters: [otlp/tempo]
For storage-layer encryption details including KMS key configuration for Jaeger and Tempo backends, see Encrypting trace payloads at rest and in transit.
Step 5 — Apply Zero-Trust Access Governance to Trace Queries
Trace storage backends (Jaeger, Tempo, Grafana) must enforce least-privilege query access. Without RBAC or a policy layer, any engineer with network access to the query API can retrieve full trace data, including spans from the restricted data zone.
OPA/Rego policy for trace query access control
package trace_access
import future.keywords.if
import future.keywords.in
default allow := false
# SREs can query all internal-zone traces
allow if {
input.user.role == "sre"
input.resource.zone in {"internal", "dev"}
}
# Developers can query dev-zone traces only if no PCI attributes are present
allow if {
input.user.role == "developer"
input.resource.zone == "dev"
not any_pci_attr
}
# Compliance auditors see restricted-zone traces but only via the audit query path
allow if {
input.user.role == "compliance_auditor"
input.request.path == "/api/traces/audit"
}
any_pci_attr if {
some attr in input.trace.attributes
startswith(attr, "payment.")
}
Route all trace query requests through this policy layer. Log every access attempt — including denials — to a centralized SIEM. Immutable audit logs are the evidence that a regulator will ask for.
Verification
After deploying these controls, verify each layer independently before relying on them together.
Header sanitization at zone boundaries
# Capture raw OTLP traffic to check for PII leakage before the Collector's processor fires
sudo tcpdump -i eth0 -A -s 0 'port 4317 or port 4318' | grep -E 'user\.email|db\.statement|http\.request\.body'
# Zero matches expected after SDK-level redaction is in place
SpanProcessor unit test (Go)
func TestBoundaryRedactor_RedactsBlockedAttrs(t *testing.T) {
redactor := NewBoundaryRedactor([]string{"user.email"})
span := startTestSpan(t, attribute.String("user.email", "[email protected]"),
attribute.String("http.method", "POST"))
redactor.OnStart(context.Background(), span)
attrs := attributeMap(span.Attributes())
assert.Equal(t, "[REDACTED]", attrs["user.email"])
assert.Equal(t, "POST", attrs["http.method"]) // non-blocked attribute preserved
}
mTLS validation
openssl s_client -connect collector.internal:4317 \
-cert /etc/otel/certs/client.crt \
-key /etc/otel/certs/client.key \
-CAfile /etc/otel/certs/ca.crt \
-tls1_2 </dev/null 2>&1 | grep 'Verify return code'
# Expected: Verify return code: 0 (ok)
OPA policy smoke test
# Should return {"allow": false} — developer cannot access restricted zone
echo '{"user":{"role":"developer"},"resource":{"zone":"restricted"},"trace":{"attributes":[]}}' \
| opa eval -d policy.rego -I 'data.trace_access.allow'
Edge Cases and Gotchas
-
Go SDK OnEnd mutation is impossible. The
ReadOnlySpanpassed toOnEnddoes not permit attribute changes. Any redaction logic placed inOnEndsilently has no effect. Place all redaction inOnStartviaReadWriteSpan, or use the Collector’sattributesprocessor as a secondary control. -
tracestate vendor extensions carry implicit data. The
tracestateheader can carry vendor-specific key-value pairs (e.g.,dd=s:2;t.dm:-4,b3=...). These are not sanitized by standard propagators and may encode sampling decisions or user-tier hints. Striptracestateentirely at external boundaries, or parse and allowlist known keys. -
Async batch processor spillover writes plaintext to disk. When the BatchSpanProcessor queue exceeds its buffer limits under memory pressure, some SDK implementations write overflow spans to a local spool file. If the host filesystem lacks encryption (dm-crypt, LUKS, or cloud provider disk encryption), those spans persist unencrypted. Monitor
otelcol_exporter_queue_sizeand alert before the queue saturates. -
Baggage propagates indefinitely unless explicitly dropped. Baggage keys set by an upstream service are forwarded by every SDK that processes the request unless a processor explicitly removes them. A tenant-ID or internal routing hint injected at the edge will arrive at your third-party logging sink if you do not strip it at the outbound boundary.
-
OPA policy evaluation latency compounds on hot query paths. Embedding an OPA sidecar in the trace query path adds 1–5ms per evaluation at low cardinality. Cache immutable identity claims (JWT role, team membership) with a short TTL to avoid re-evaluating unchanged facts on every span query.
-
mTLS certificate rotation requires coordinated reload. Replacing collector certificates without a coordinated reload causes a window where SDK clients fail TLS handshakes and fall back to plaintext (if
OTEL_EXPORTER_OTLP_INSECUREis not explicitly set tofalse). Use certificate management tooling (cert-manager, Vault PKI) with automatic reload signals (SIGHUPon the Collector process).
Performance and Scale Notes
- SpanProcessor chain overhead: Each registered processor adds a synchronous function call per span. A single
BoundaryRedactorwith 10 blocked attributes adds under 1μs per span on modern hardware. Chain processors economically — one redactor covering all zones is preferable to four zone-specific processors. - Collector
attributesprocessor hashing: Thehashaction ondb.statementuses SHA-256. At 50,000 spans/second, hashing adds approximately 15ms CPU load on a 2-core collector instance. Pre-allocate the collector with sufficient CPU headroom for the peak span rate. - OPA policy caching: The OPA Go SDK caches compiled policy bundles in memory. For trace query backends serving 500+ concurrent engineers, use the OPA REST API with a shared sidecar per region rather than per-process embedding to avoid redundant bundle compilation overhead.
- mTLS handshake amortization: OTLP exporters hold persistent gRPC connections. At 50,000 spans/second and a default batch size of 512, each connection handles ~100 batches/second. The TLS handshake cost (1–3ms) is amortized across thousands of spans per connection, contributing under 0.01ms per span in steady state.
Troubleshooting FAQ
Why does redacting PII in the OpenTelemetry Collector still leak data?
The Collector processes spans after they leave the SDK, meaning PII-bearing attributes traverse the network before any processor can act on them. If the SDK-to-Collector channel is unencrypted or the Collector is compromised, data leaks before redaction. Enforce attribute redaction inside the SDK’s SpanProcessor so sensitive data never leaves the process in plaintext.
Can traceparent headers be exploited for injection attacks?
Yes. A malformed or adversarially crafted traceparent can poison trace context, link unrelated spans into attacker-controlled traces, or inflate cardinality. Always validate incoming traceparent values against the W3C spec (version, trace-id format, flags) before creating child spans. Reject or generate a fresh root span for non-compliant values.
How should Kafka message headers carry trace context across trust boundaries?
Strip the raw traceparent header before publishing to external or partner queues. Extract only a sanitized correlation ID (the trace-id portion, without flags or tracestate), write it to a dedicated application-level header (e.g., x-correlation-id), and let the consumer create a new root span that references this ID as a linked context rather than a parent.
What is the performance impact of mTLS on high-throughput OTLP pipelines?
TLS handshakes add 1–3ms per TCP connection. Because OTLP exporters maintain persistent connections and batch spans, the per-span overhead is under 0.01ms. CPU cost stays below 2% on modern hardware. Use connection pooling and keep-alive to amortize handshake cost across thousands of spans per connection.
Why does the Go SDK OnEnd not allow attribute mutation for redaction?
The Go OTel SDK provides a ReadOnlySpan in OnEnd, which cannot be mutated after the span has ended. Redaction must happen in OnStart via ReadWriteSpan, before attributes are finalized. Alternatively, use the OpenTelemetry Collector’s attributes processor for server-side redaction, accepting that the data travels the SDK-to-Collector hop in its original form.
Related
- Encrypting Trace Payloads at Rest and in Transit — mTLS configuration, KMS at-rest encryption, and SDK plaintext fallback prevention
- Understanding W3C TraceContext Propagation — header lifecycle, inject/extract mechanics, and safe propagation semantics
- Choosing Between Head-Based and Tail-Based Sampling — balancing data exposure risk with telemetry budget at sampling decision points
- Trace Storage Backend Comparison: Jaeger vs Tempo — access control and data governance trade-offs across storage backends