Tenant Context Propagation in Multi-Tenant SaaS

Problem Framing

When a tenant identifier drops mid-flight in a multi-service SaaS system, the operational damage is immediate and hard to diagnose. Billing pipelines aggregate usage under the wrong account. Tenant-aware rate limiters see requests with no identity and either pass them all through or reject them all. Most critically, your observability pipeline loses the ability to correlate logs, metrics, and traces to a specific customer — turning every cross-tenant incident into a manual log trawl. The challenge is not injecting the tenant ID once at the edge; it is keeping that ID attached to every span, every async message, and every background job as the request fans out across dozens of services.

Prerequisites

Before implementing the patterns on this page, verify the following are in place:

  • OpenTelemetry SDK v1.10 or later in all services (stable Baggage API)
  • The W3C tracecontext and baggage propagators registered in every process — not just the ingress service
  • A consistent tenant identifier format (UUIDv4 recommended) enforced at account creation
  • Reverse proxies and API gateways configured to forward rather than strip the baggage header
  • Familiarity with how OpenTelemetry Baggage differs from Span Attributes — both are used here, for different purposes

How Tenant Context Flows Through a Distributed Request

The diagram below shows the lifecycle of a tenant.id from the API gateway through synchronous service calls and then across an async Kafka boundary.

Tenant context propagation from API gateway through services and Kafka A flow diagram showing how tenant.id is extracted at the API Gateway, injected into W3C Baggage, forwarded through Service A and Service B via HTTP headers, then serialised into Kafka record headers and reconstructed in the Consumer Worker. API Gateway Extract + validate tenant.id baggage: tenant.id=… Service A Read Baggage → add Span Attribute baggage: tenant.id=… Service B Read Baggage → publish to Kafka record header: traceparent + baggage Kafka headers preserved across partitions Consumer Reconstruct context Synchronous HTTP (W3C Baggage header forwarded) Async boundary (explicit header serialisation required)

The key insight: synchronous HTTP calls inherit baggage automatically once propagators are configured. Async boundaries — Kafka, SQS, RabbitMQ — break that automatic flow. You must serialise context into message headers on the producer side and reconstruct it on the consumer side.

Step-by-Step Implementation

Step 1 — Extract and Validate at the Ingress Layer

The API gateway or ingress controller is the only point where you can trust the tenant identifier. Extract it from the JWT tenant or sub claim, from subdomain routing, or from an X-Tenant-ID header (in that priority order). Validate the format against a strict allowlist regex and cross-reference against a tenant registry before passing anything downstream.

// Node.js ingress middleware
const { propagation, context } = require('@opentelemetry/api');
const TENANT_REGEX = /^[a-zA-Z0-9_-]{8,64}$/;

function injectTenantContext(req, res, next) {
  // Prefer JWT claim over raw header — headers can be spoofed by callers
  const tenantId = extractFromJWT(req) ?? req.headers['x-tenant-id'];

  if (!tenantId) {
    return next(Object.assign(new Error('Missing tenant context'), { status: 401 }));
  }
  if (!TENANT_REGEX.test(tenantId)) {
    return next(Object.assign(new Error('Invalid tenant format'), { status: 400 }));
  }

  const bag = propagation.createBaggage().set('tenant.id', { value: tenantId });
  const ctx = propagation.setBaggage(context.active(), bag);
  context.with(ctx, () => next());
}

Step 2 — Attach tenant.id to OpenTelemetry Baggage

Once validated, write the tenant ID into OpenTelemetry Baggage so every downstream SDK call automatically forwards it. Also record it as a Span Attribute on the root span so your tracing backend can filter and aggregate by tenant without having to decode the baggage header.

// Go — attach to Baggage and record as a Span Attribute
import (
  "go.opentelemetry.io/otel/baggage"
  "go.opentelemetry.io/otel/attribute"
  "go.opentelemetry.io/otel/trace"
)

func attachTenantContext(ctx context.Context, tenantID string) context.Context {
  m, _ := baggage.NewMember("tenant.id", tenantID)
  b, _ := baggage.New(m)
  ctx = baggage.ContextWithBaggage(ctx, b)

  // Also stamp the active span so Jaeger/Tempo can index it
  span := trace.SpanFromContext(ctx)
  span.SetAttributes(attribute.String("tenant.id", tenantID))

  return ctx
}
// Java — makeCurrent() propagates via ThreadLocal; always close the Scope
import io.opentelemetry.api.baggage.Baggage;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.context.Scope;

// ~50 ns overhead per request from Scope allocation; negligible vs. network latency
try (Scope scope = Baggage.current().toBuilder()
        .put("tenant.id", tenantId)
        .build()
        .makeCurrent()) {

  Span.current().setAttribute("tenant.id", tenantId);
  processRequest();  // downstream calls inherit the context
}

Step 3 — Configure Propagators and Reverse Proxies

A tenant ID in Baggage is only useful if the baggage header survives every network hop. Two places commonly strip it silently:

Reverse proxies (Nginx)

# nginx.conf — forward Baggage through to upstream services
location /api/ {
  proxy_pass         http://backend;
  proxy_set_header   baggage      $http_baggage;
  proxy_set_header   traceparent  $http_traceparent;
  proxy_set_header   tracestate   $http_tracestate;
}

SDK propagator configuration

# Java agent — ensure both propagators are active
otel.propagators=tracecontext,baggage

# Restrict which keys pass through to prevent arbitrary metadata injection
otel.baggage.keys=tenant.id,request.region
// Node.js SDK initialisation
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { W3CBaggagePropagator, W3CTraceContextPropagator, CompositePropagator } = require('@opentelemetry/core');

const sdk = new NodeSDK({
  textMapPropagator: new CompositePropagator({
    propagators: [new W3CTraceContextPropagator(), new W3CBaggagePropagator()],
  }),
});
sdk.start();

Service mesh sidecars (Envoy, Linkerd) need the same treatment: configure the baggage header in the proxy’s allowed-headers list or it will be dropped at the sidecar layer.

Step 4 — Preserve Context Across Async Boundaries

Context propagation across Kafka consumers requires explicit serialisation. The W3C Baggage propagator cannot inject headers automatically into a Kafka record — you must do it in the producer interceptor.

# Python Kafka producer — serialise context into record headers
from opentelemetry import context, propagate
from confluent_kafka import Producer

def publish_with_context(producer: Producer, topic: str, payload: bytes) -> None:
    headers: dict[str, str] = {}
    # Inject traceparent, tracestate, and baggage (including tenant.id) into headers
    propagate.inject(headers)

    producer.produce(
        topic,
        value=payload,
        headers=[(k, v.encode()) for k, v in headers.items()],
    )
    producer.flush()
# Python Kafka consumer — reconstruct context before processing
from opentelemetry import propagate, context

def consume_message(msg) -> None:
    # Decode headers from bytes and restore the full OTel context
    carrier = {k: v.decode() for k, v in (msg.headers() or [])}
    ctx = propagate.extract(carrier)

    token = context.attach(ctx)
    try:
        process_message(msg)  # tenant.id is now in Baggage and propagates further
    finally:
        context.detach(token)

For dead-letter queues and retry workers, include tenant.id in both the baggage header and a dedicated application-level field in the message envelope. That way, routing logic can read the tenant ID without parsing trace headers, and the trace link is preserved independently.

Step 5 — Assert Propagation at Each Service Hop

Add a middleware assertion in every service that reads the baggage and logs a warning (or rejects the request, in strict-isolation mode) when tenant.id is absent. This turns propagation failures into visible signals rather than silent data-quality problems.

// Go — per-hop tenant assertion middleware
func AssertTenantMiddleware(next http.Handler) http.Handler {
  return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
    bag := baggage.FromContext(r.Context())
    tenantID := bag.Member("tenant.id").Value()

    if tenantID == "" {
      // In strict mode: reject. In permissive mode: log and continue.
      http.Error(w, "propagation failure: missing tenant.id", http.StatusInternalServerError)
      return
    }

    // Stamp the span so Tempo/Jaeger can index this hop by tenant
    trace.SpanFromContext(r.Context()).SetAttributes(
      attribute.String("tenant.id", tenantID),
    )
    next.ServeHTTP(w, r)
  })
}

Step 6 — Apply Security and Compliance Filters at the Collector

The OpenTelemetry Collector is the right place to apply tenant-level security controls before trace data reaches storage. Use the attributes processor to enforce an allowlist and mask any values that should not be persisted.

# otel-collector-config.yaml — allowlist baggage keys, mask regulated values
processors:
  attributes/tenant_filter:
    actions:
      # Keep only approved keys
      - key: tenant.id
        action: upsert
      # Mask any email that leaked into attributes
      - key: user.email
        action: hash
      # Drop any key not in the allowlist
      - key: baggage.raw
        action: delete
  filter/drop_internal:
    spans:
      exclude:
        match_type: regexp
        attributes:
          - key: tenant.id
            value: "^internal-.*"  # exclude internal health-check traffic

service:
  pipelines:
    traces:
      processors: [attributes/tenant_filter, filter/drop_internal]

Verification

Query Jaeger or Tempo for spans where tenant.id is missing to identify propagation gaps:

-- Tempo TraceQL — find spans where tenant context dropped
{ span.tenant.id = "" || span.tenant.id !exists }
| select(span.service.name, span.http.route, rootSpan.startTime)
| limit 50

In Jaeger UI, filter by the tag tenant.id=<your-id> and inspect the waterfall for gaps. A gap — a span with no tenant.id attribute — points to the upstream service or proxy that stripped the header. Cross-reference that service’s ingress timestamp with the first missing span to isolate the break point.

You can also write a CI/CD assertion that replays a test request through your staging environment and asserts that every span in the resulting trace carries tenant.id:

# pytest integration test — assert tenant.id on all spans
def test_tenant_id_propagates(trace_exporter):
    make_request(headers={"X-Tenant-ID": "test-tenant-abc"})
    spans = trace_exporter.get_finished_spans()

    assert len(spans) > 0, "No spans recorded"
    for span in spans:
        assert span.attributes.get("tenant.id") == "test-tenant-abc", (
            f"Missing tenant.id on span: {span.name}"
        )

Edge Cases and Gotchas

  1. Thread-pool context bleeding (Java/Go): When a thread or goroutine is reused across requests, a Scope or Context that was not closed carries the previous request’s tenant.id into the next one. Always use try-with-resources in Java or defer scope.Close() in Go, and never store a context in a struct field that outlives the request.

  2. Sampling drops tenant context before it is recorded: Head-based sampling makes the keep/drop decision at the first span — before baggage has propagated to child services. If you rely on tenant.id for billing or SLA reporting, switch to parent-based or tail-based sampling so that tenant context is available when the sampling decision is made.

  3. gRPC metadata key naming: The gRPC metadata spec requires lowercase keys. Using Tenant-ID instead of tenant-id (or mapping to a custom metadata key without registering it in the interceptor chain) causes the key to be silently ignored. Keep all baggage keys lowercase.

  4. Baggage surviving a redirect: HTTP 301/302 redirects cause most HTTP clients to drop non-standard headers, including baggage. If your API gateway redirects requests (e.g., HTTP→HTTPS or path normalisation), ensure context is re-injected after the redirect rather than relying on the client to forward it.

  5. Out-of-order Kafka consumption: A consumer reading from multiple partitions may process messages for different tenants concurrently in the same thread. Never rely on thread-local context in async consumer loops — always extract and pass context explicitly into each message handler invocation.

  6. Baggage header size limits: Keep the total baggage header under 4 KB in practice (8 KB is a common proxy hard limit). Use a compact tenant ID format (UUID, 36 chars) rather than long human-readable slugs, and avoid adding unbounded metadata to baggage.

Performance and Scale Notes

  • Baggage extraction overhead: Reading a baggage entry is O(n) over the number of baggage members. Keep the number of baggage entries small (ideally one or two) to avoid degrading hot paths. The total overhead is under 1 µs per hop for a single tenant.id entry.
  • Span Attribute cardinality: Recording tenant.id as a Span Attribute is safe if your tenant count is bounded (thousands, not millions). For very high tenant counts, consider recording the attribute only on root and exit spans, not on every internal child span, to avoid cardinality explosion in your metrics pipeline.
  • Collector throughput: The attributes processor in the Collector runs synchronously in the pipeline. Keep allowlist rules simple; complex regex transforms at high throughput (>100k spans/s) can become a bottleneck. Pre-filter at the SDK level where possible.
  • Context propagation in async workers: Using AsyncLocalStorage (Node.js) or contextvars (Python) for async boundary handling is safe for I/O-bound work but requires careful scoping in CPU-bound thread pools where tasks outlive the originating async context.

Troubleshooting FAQ

Why does tenant.id disappear mid-trace even though my ingress injects it?

The most common cause is a reverse proxy stripping unrecognised headers. Add proxy_set_header baggage $http_baggage; in Nginx (or the equivalent directive in your gateway). The second most common cause is an SDK propagator list that omits the baggage propagator — verify otel.propagators includes baggage alongside tracecontext.

How do I carry tenant context through Kafka without losing the trace link?

Serialise both traceparent and baggage into Kafka record headers before producing (see Step 4 above). On the consumer, extract those headers back into an OpenTelemetry context before starting the consumer span. This preserves W3C TraceContext propagation across the async boundary and keeps the consumer span linked to the producer’s trace.

Should I store tenant.id in Baggage or as a Span Attribute?

Use both: Baggage propagates the value to every downstream service automatically; a Span Attribute makes it queryable in Jaeger or Tempo without decoding headers. See Baggage vs Span Attributes for the trade-offs in detail.

What is the safe maximum size for tenant-related baggage?

The W3C Baggage specification sets no hard limit, but most proxies and gRPC implementations truncate headers beyond 8 KB. Target under 4 KB for the entire baggage header in practice. UUIDs (36 chars) are ideal tenant ID formats.

How do I prevent external callers from injecting arbitrary baggage keys?

Validate the tenant ID format at the ingress layer before writing it into Baggage (regex or allowlist). Apply an allowlist-based attributes processor in the OpenTelemetry Collector to drop any baggage-derived attributes that are not explicitly permitted. This prevents log poisoning and stops cardinality attacks on your metrics backend.


↑ Back to Baggage & Metadata Routing Workflows