sre

OpenTelemetry Tutorial 2026: Complete Setup Guide for SRE & DevOps

Hands-on OpenTelemetry tutorial covering instrumentation, collector configuration, and distributed tracing setup for SRE and DevOps engineers in 2026.

June 26, 2026·20 min read·
#opentelemetry#observability#distributed-tracing#prometheus#jaeger#devops

Introduction

If you operate microservices in production, you already know the pain. A user reports a slow checkout. You open three different dashboards — Grafana for metrics, Jaeger for traces, and grep for logs. By the time you correlate the request ID across all three, the incident has been open for 45 minutes.

OpenTelemetry (OTel) solves this by unifying all three signals under one standard. It is now the CNCF's second-most active project after Kubernetes, and every major observability vendor — Datadog, Honeycomb, Grafana Labs, New Relic — has adopted its protocol. In 2026, if you are not instrumenting with OpenTelemetry, you are building technical debt every time you ship code.

This tutorial walks you through a complete OpenTelemetry setup: instrumentation with the OTel SDK, collector configuration, and exporting traces and metrics to Jaeger and Prometheus. Everything is hands-on with real YAML and code snippets you can run today.

By the end, you will have:

  • A Python service auto-instrumented with traces and metrics
  • An OpenTelemetry Collector processing and exporting telemetry
  • Traces visible in Jaeger and metrics scraped by Prometheus
  • A working mental model of OTel's pipeline architecture

What Is OpenTelemetry, Actually?

OpenTelemetry is not a backend. It is not a database, a dashboard, or an alerting engine. It is a telemetry pipeline standard — a specification, a set of SDKs, and a collector binary that together generate, process, and export traces, metrics, and logs.

The project emerged from the 2019 merger of OpenTracing and OpenCensus. Both were CNCF observability projects with overlapping goals. Rather than compete, they merged into a single standard. Today, OTel is at version 1.34+ and is considered stable for traces and metrics.

Three things make OpenTelemetry different from what came before:

  1. Vendor-neutral instrumentation. You instrument once with the OTel SDK. Changing backends — from Jaeger to Honeycomb, or from Prometheus to Datadog — means changing an exporter config, not rewriting code.

  2. The Collector. A standalone binary that receives, processes, and exports telemetry. You can run it as a sidecar, a daemonset, or a central gateway. It handles batching, filtering, sampling, and routing — all config-driven.

  3. Context propagation. The traceparent header (W3C Trace Context standard) passes trace context across HTTP, gRPC, and message queues. Every hop in your distributed system links back to a single root span without custom headers.

The telemetry pipeline looks like this:

Application Code --> OTel SDK --> OTel Collector --> Backend (Jaeger/Prometheus/...)
     (API calls)       (auto/manual)    (process/route)         (store/query)

The SDK generates spans and metrics inside your application process. The Collector — a separate binary — receives them via OTLP (OpenTelemetry Protocol) over gRPC or HTTP, then applies processors and exports to one or more backends.

This separation matters. Your application never talks directly to Jaeger or Prometheus. It only talks to the Collector. The Collector absorbs backend changes without touching application code.

OpenTelemetry Architecture: The Pipeline Model

Every observability signal in OTel follows the same pipeline: Instrumentation → Processing → Export.

The Three Components

1. Instrumentation Libraries (SDK)

The SDK lives inside your application process. It creates spans, records metrics, and captures log events. OTel provides SDKs for Python, Go, Java, JavaScript, .NET, Rust, and more. You can use auto-instrumentation (zero code changes — the agent injects hooks at runtime) or manual instrumentation (explicit start_span() and end_span() calls in your code).

Auto-instrumentation covers most common libraries by default: HTTP frameworks (Flask, Express, Spring), database drivers (psycopg2, pgx, JDBC), and gRPC clients. For custom business logic, you add manual spans.

2. The OpenTelemetry Collector

The Collector is the backbone of any production OTel deployment. It is a single Go binary (otelcol-contrib) that runs three types of components in a pipeline:

  • Receivers: Accept telemetry data (OTLP gRPC, OTLP HTTP, Jaeger, Zipkin, Prometheus scrape)
  • Processors: Transform data in-flight (batch, filter, tail sampling, attributes mutation, redaction)
  • Exporters: Send data to backends (Jaeger, Prometheus, Datadog, Honeycomb, Kafka, stdout)

The Collector decouples your application from backends. If you switch from Jaeger to Tempo, or add a second exporter for Honeycomb, you change one YAML file — not every microservice.

3. Exporters and Backends

Exporters are protocol-specific components that push data to observability backends. Common exporters include:

ExporterProtocolTypical Backend
otlpgRPC/HTTPAny OTLP-compatible backend (Jaeger, Tempo, Grafana Agent)
prometheusHTTP scrapePrometheus server
jaegerThrift/gRPCJaeger backend
loggingstdoutDebugging during development
kafkaKafkaLong-term buffering, multi-datacenter pipelines

The OTLP Protocol

All communication between the SDK and the Collector uses OTLP (OpenTelemetry Protocol). OTLP is a Protobuf-based protocol that runs over gRPC (port 4317) or HTTP/1.1 (port 4318). In 2026, OTLP over HTTP has matured enough that many teams prefer it over gRPC for simpler firewall traversal and load balancer compatibility.

A typical OTLP trace payload is a binary-encoded Protobuf message containing resource attributes (service name, host, namespace), span data (trace ID, span ID, parent span ID, start/end timestamps, attributes, events), and instrumentation scope.

Setting Up the OpenTelemetry Collector

Let's start with the Collector — it is the first piece you deploy because your applications need somewhere to send telemetry.

Step 1: Install the Collector

The recommended distribution is otelcol-contrib, which includes receivers and exporters for every major observability tool:

# Linux (AMD64)
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.110.0/otelcol-contrib_0.110.0_linux_amd64.tar.gz
tar -xzf otelcol-contrib_0.110.0_linux_amd64.tar.gz
sudo mv otelcol-contrib /usr/local/bin/

# Verify
otelcol-contrib --version

For Docker-based development:

docker run -d --name otel-collector \
  -p 4317:4317 -p 4318:4318 -p 8888:8888 \
  -v $(pwd)/otel-config.yaml:/etc/otelcol/config.yaml \
  otel/opentelemetry-collector-contrib:0.110.0

Step 2: Write the Collector Configuration

Create otel-config.yaml. This is the heart of your observability pipeline:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: otel
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus, logging]

This configuration does several things:

  • Receivers listen on ports 4317 (gRPC) and 4318 (HTTP) for OTLP data from applications
  • Processors batch spans for efficiency, limit memory usage to 512 MiB, and add an environment=production attribute to every span
  • Exporters forward traces to Jaeger, expose metrics on port 8889 for Prometheus scraping, and log debug output to stdout
  • Pipelines wire everything together — traces and metrics take different paths through the same Collector

The batch processor is critical for production. Without it, the Collector sends one span at a time to Jaeger, creating massive network overhead. Batching amortizes the cost across 512 spans.

Step 3: Run the Collector

otelcol-contrib --config=otel-config.yaml

You should see log output confirming that all receivers, processors, and exporters are active. The Collector is now ready to receive telemetry from your applications.

Instrumenting Your First Application

Now that the Collector is running, let's instrument a Python web service. We will use Flask for the HTTP layer and the OpenTelemetry Python SDK for auto-instrumentation, then add manual spans for custom business logic.

Step 1: Install Dependencies

pip install flask opentelemetry-api opentelemetry-sdk \
  opentelemetry-instrumentation-flask \
  opentelemetry-instrumentation-requests \
  opentelemetry-exporter-otlp-proto-grpc

The key packages:

  • opentelemetry-api and opentelemetry-sdk — the core OTel SDK
  • opentelemetry-instrumentation-flask — auto-instrumentation for Flask (creates spans for each HTTP request automatically)
  • opentelemetry-instrumentation-requests — auto-instrumentation for the requests library (spans for outbound HTTP calls)
  • opentelemetry-exporter-otlp-proto-grpc — the OTLP exporter that sends data to our Collector

Step 2: Write the Application

# app.py
from flask import Flask, request, jsonify
import requests
import time

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# --- OTel Setup ---

resource = Resource(attributes={
    SERVICE_NAME: "checkout-service",
    "deployment.environment": "staging"
})

provider = TracerProvider(resource=resource)

otlp_exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317",
    insecure=True
)

provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

# --- Application ---

app = Flask(__name__)

# Auto-instrument Flask and outgoing HTTP requests
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

# Get a tracer for manual instrumentation
tracer = trace.get_tracer(__name__)


@app.route("/checkout", methods=["POST"])
def checkout():
    """Process a checkout — spans created automatically by FlaskInstrumentor."""

    data = request.get_json()

    # Manual span for the payment processing step
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("payment.amount", data.get("amount", 0))
        span.set_attribute("payment.method", data.get("method", "unknown"))

        # Simulate payment work
        time.sleep(0.15)

        payment_result = process_payment(data.get("amount", 0))

        span.set_attribute("payment.status", payment_result["status"])
        span.set_status(trace.Status(trace.StatusCode.OK))

    # Manual span for inventory update
    with tracer.start_as_current_span("update_inventory") as span:
        span.set_attribute("inventory.items", len(data.get("items", [])))

        time.sleep(0.08)

        # Outbound HTTP call — automatically traced by RequestsInstrumentor
        resp = requests.post(
            "http://inventory-service:5001/update",
            json={"items": data.get("items", [])}
        )
        span.set_attribute("inventory.response_code", resp.status_code)

    return jsonify({"status": "ok", "order_id": "ord-2026-abc"})


def process_payment(amount: float) -> dict:
    """Simulated payment gateway call."""
    return {"status": "authorized", "transaction_id": "txn-42", "amount": amount}


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

What This Code Does

Every /checkout request now generates a trace with multiple spans:

  1. Root span — created automatically by FlaskInstrumentor for the HTTP request
  2. process_payment — manual span wrapping payment logic, with custom attributes (amount, method, status)
  3. update_inventory — manual span wrapping inventory logic
  4. HTTP POST to inventory-service — nested span created by RequestsInstrumentor, linked to the parent update_inventory span

Context propagation is automatic. When the /checkout handler calls requests.post(...), the OTel SDK injects the traceparent header into the outbound HTTP request. If the inventory service is also instrumented with OTel, it extracts that header and continues the same trace — creating a single distributed trace across both services.

Step 3: Run and Verify

# Terminal 1: Start the Collector (if not already running)
otelcol-contrib --config=otel-config.yaml

# Terminal 2: Start the Flask app
python app.py

# Terminal 3: Generate a trace
curl -X POST http://localhost:5000/checkout \
  -H "Content-Type: application/json" \
  -d '{"amount": 49.99, "method": "card", "items": [{"id": 1}, {"id": 2}]}'

Check the Collector's debug log output. You should see spans being received, processed, and exported. The logging exporter will print span summaries to stdout — useful for debugging before you wire up Jaeger.

Look for lines like:

Span #0
    Trace ID       : 6e8f4c7a1b2d3e4f5a6b7c8d9e0f1a2b
    Parent ID      :
    ID             : 3a4b5c6d7e8f9a0b
    Name           : POST /checkout
    Kind           : Server
    ...

Span #1
    Trace ID       : 6e8f4c7a1b2d3e4f5a6b7c8d9e0f1a2b
    Parent ID      : 3a4b5c6d7e8f9a0b
    ID             : 1b2c3d4e5f6a7b8c
    Name           : process_payment
    ...

The shared Trace ID across both spans confirms that context propagation is working — both spans belong to the same distributed trace.

Exporting Traces to Jaeger

The logging exporter is useful for debugging, but you need a real trace backend. Let's set up Jaeger and configure the Collector to forward traces.

Step 1: Run Jaeger All-in-One

For development, Jaeger's all-in-one image bundles the collector, query UI, and in-memory storage:

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/all-in-one:1.62
  • Port 16686: Jaeger UI (open http://localhost:16686)
  • Port 4317: OTLP gRPC receiver (Jaeger can accept OTLP directly as of 1.35+)

However, routing through our Collector is the production pattern. Update the Collector config to point at Jaeger:

# otel-config.yaml (exporter section update)
exporters:
  otlp/jaeger:
    endpoint: localhost:4317
    tls:
      insecure: true
  # ... keep the other exporters

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [otlp/jaeger, logging]

Step 2: Generate Traces and Inspect

Send a few checkout requests:

for i in $(seq 1 5); do
  curl -s -X POST http://localhost:5000/checkout \
    -H "Content-Type: application/json" \
    -d '{"amount": 49.99, "method": "card", "items": [{"id": 1}]}' > /dev/null
done

Open Jaeger UI at http://localhost:16686:

  1. Select checkout-service from the Service dropdown
  2. Click Find Traces
  3. You should see 5 traces, each containing multiple spans

Click any trace to view the waterfall diagram. You will see the parent POST /checkout span and its children — process_payment, update_inventory, and potentially the outbound HTTP call to inventory-service. Expand a span to see attributes like payment.amount, payment.method, and payment.status.

Debugging Tip: Missing Spans

If you see the root span but not the child spans, check:

# Verify the Collector is receiving spans
curl http://localhost:8888/metrics | grep otelcol_receiver_accepted_spans

# Check Collector logs for export errors
otelcol-contrib --config=otel-config.yaml 2>&1 | grep -i error

Common causes:

  • Batch processor delay: Spans are batched for up to 5 seconds before export. Wait at least 5 seconds after sending a request.
  • OTLP endpoint mismatch: The SDK sends to localhost:4317 but the Collector listens on a different host. Use 0.0.0.0:4317 in the Collector config for local dev.
  • TLS mismatch: If the Collector expects TLS but the SDK sends plaintext (or vice versa), the connection fails silently. Match insecure: true settings on both sides.

Exporting Metrics to Prometheus

Traces tell you what happened. Metrics tell you how often and how fast. OTel's metrics pipeline works the same way, but the Prometheus exporter is an HTTP server that Prometheus scrapes — it does not push.

Step 1: Configure Prometheus Scrape

Add a scrape target to your prometheus.yml:

scrape_configs:
  - job_name: "otel-collector"
    scrape_interval: 15s
    static_configs:
      - targets: ["localhost:8889"]

The Collector's Prometheus exporter already listens on port 8889 (from our earlier config). No additional setup is needed.

Step 2: Auto-Instrument Metrics

The Flask instrumentation also captures HTTP server metrics automatically:

# Add to app.py after the trace setup
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://localhost:4317", insecure=True),
    export_interval_millis=15000
)

meter_provider = MeterProvider(
    resource=resource,
    metric_readers=[metric_reader]
)
metrics.set_meter_provider(meter_provider)

This exports HTTP request counts, latency histograms, and error rates — all generated automatically by FlaskInstrumentor.

Step 3: Verify Metrics in Prometheus

# Check that Prometheus is scraping the Collector
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="otel-collector")'

# Query a metric
curl "http://localhost:9090/api/v1/query?query=http_server_duration_milliseconds_bucket"

The metrics pipeline is now live: your application generates metrics, the SDK ships them to the Collector, and Prometheus scrapes the Collector's Prometheus exporter endpoint. Grafana can query Prometheus to build dashboards.

Adding a Custom Metric

Beyond auto-instrumentation, add a business-level counter:

from opentelemetry import metrics

meter = metrics.get_meter(__name__)
order_counter = meter.create_counter(
    "checkout.orders",
    description="Number of completed checkouts",
    unit="1"
)

@app.route("/checkout", methods=["POST"])
def checkout():
    # ... existing code ...
    order_counter.add(1, {"method": data.get("method", "unknown")})
    return jsonify({"status": "ok"})

Now you have a checkout_orders_total metric in Prometheus, labeled by payment method. Query it to track business throughput — not just infrastructure health.

Deploying OpenTelemetry on Kubernetes

Running the Collector as a standalone binary works for development. In production, you deploy it to Kubernetes using one of three patterns. Each has tradeoffs in scalability, latency, and operational complexity.

Pattern 1: Sidecar (Per-Pod Collector)

A Collector container runs alongside your application container in the same pod. The application sends telemetry to localhost:4317.

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: checkout-service
  template:
    metadata:
      labels:
        app: checkout-service
    spec:
      containers:
        - name: app
          image: checkout-service:latest
          ports:
            - containerPort: 5000
          env:
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://localhost:4317"

        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.110.0
          args: ["--config=/etc/otelcol/config.yaml"]
          volumeMounts:
            - name: otel-config
              mountPath: /etc/otelcol
      volumes:
        - name: otel-config
          configMap:
            name: otel-collector-sidecar-config

Pros: Simple, no network hops, pod-level isolation. Cons: One Collector per pod wastes resources. 100 pods = 100 Collectors. Not suitable for large clusters unless you run low-resource Collector replicas.

Pattern 2: DaemonSet (Per-Node Collector)

One Collector runs on every node as a DaemonSet. All pods on that node send telemetry to the node-local Collector via the host network or a node port.

# otel-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      hostNetwork: true
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.110.0
          args: ["--config=/etc/otelcol/config.yaml"]
          ports:
            - containerPort: 4317
              hostPort: 4317
            - containerPort: 4318
              hostPort: 4318
          volumeMounts:
            - name: otel-config
              mountPath: /etc/otelcol
          resources:
            limits:
              memory: 512Mi
              cpu: 500m
      volumes:
        - name: otel-config
          configMap:
            name: otel-collector-daemonset-config


## Advanced Sampling Strategies

Tail sampling is one of OpenTelemetry's most powerful features — and one of the easiest to misconfigure. Understanding the decision flow will save you from exploding your telemetry bill or dropping critical traces.

### Head-Based Sampling (Probabilistic)

Head sampling happens at span creation time. The SDK decides immediately whether to keep or drop a span — no buffering, no delay. This is the default if you configure nothing else:

```yaml
# Collector config for head-based probabilistic sampling
processors:
  probabilistic_sampler:
    sampling_percentage: 10

This means 10% of all spans are kept, 90% are dropped instantly. The Collector never sees the dropped spans at all — those bytes never leave the application process. Use this when:

  • You are cost-sensitive: Every exported span costs storage and network. At 1,000 requests per second, keeping 100% of spans can saturate your observability budget.
  • You want trace completeness, not sample size: If you are debugging a specific slow request, dropping spans at the head means you lose context. Probabilistic sampling gives you a representative subset.

Tail-Based Sampling

Tail sampling makes the decision after all spans in a trace complete — 5 to 30 seconds later, when the full trace is assembled in the Collector. The processor evaluates decision policies:

processors:
  tail_sampling:
    decision_wait: 30s
    policies:
      - name: errors-and-slow
        type: and
        and_sub_policy:
          - name: status_code
            type: status_code
            status_code: {status_codes: [ERROR]}
          - name: latency-over-2s
            type: latency
            latency: {threshold_ms: 2000}
      - name: probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 25}

This configuration keeps 100% of traces that contain an error status code, samples 25% of all other traces, and drops the rest. Additionally, it keeps any span whose total duration exceeds 2 seconds. Tail sampling lets you capture the full picture of every slow request without storing every fast one.

When to Use Each

StrategyWhen to Use
Head (probabilistic)You have a strict sampling budget. You cannot store more than X spans per second. Use for high-throughput, cost-sensitive, always-on observability.
Tail (policy-based)You need every trace from a specific slow request. Use when debugging errors, analyzing latency, or auditing compliance.

Common Pitfalls and Troubleshooting

1. The Collector Is Dropping Spans Silently

This is the most common OTel production issue. The Collector receives spans from the SDK, processes them through the batch processor, then drops them silently at the exporter. Root cause: the gRPC connection between the SDK and Collector times out.

Fix: Increase the send_batch_size and reduce timeout in the batch processor:

processors:
  batch:
    timeout: 10s
    send_batch_size: 2048

Why this works: the default batch size is 512 spans. If the Collector receives 2,000 spans in 1 second, 1,488 of them exceed the default gRPC message size (4 MiB). The SDK sends 512 spans at a time. The Collector times out waiting for the remaining 488 — and drops them. Increase the batch size to 2,048 so the SDK sends larger chunks, fewer network round-trips.

2. The traceparent Header Is Missing

Your service A calls service B over HTTP. Service B is also instrumented with OTel. But the trace breaks — service B does not receive the traceparent header, so spans link back to service A but not to the same trace.

Diagnosis: Check for traceparent in the outbound HTTP headers:

curl -H "traceparent: 00-..." http://service-b:5000/endpoint

If the response header is missing, service B is not propagating context. The SDK does not inject traceparent into the outbound request. Fix: verify the instrumentation library is loaded and the HTTP client is configured.

# Explicitly configure the OTLP exporter with headers
from opentelemetry.propagators.textmap import TextMapPropagator
from opentelemetry import trace

# Set the global propagator BEFORE creating the TracerProvider
trace.set_span_processor(
    CompositePropagator(
        propagators=[
            W3CTraceContextPropagator(),
            BaggagePropagator()
        ]
    )
)

# Then create the TracerProvider
provider = TracerProvider()
trace.set_tracer_provider(provider)

The W3CTraceContextPropagator injects the W3C traceparent header into every outbound HTTP request. Without it, distributed context propagation fails silently.

3. High Cardinality Attributes Crash the Backend

Span attributes like user.id, request.id, and session.id are unbounded. If a span carries thousands of unique attributes, the Jaeger backend rejects the entire batch.

Remediation: Drop high-cardinality attributes at the SDK level:

# Create a custom SpanProcessor that truncates attributes
from opentelemetry.sdk.trace.export import SpanExporter, BatchSpanProcessor

class AttributeLimitingProcessor(BatchSpanProcessor):
    def on_end(self, span):
        # Keep only these attributes — drop everything else
        allowed_keys = {"http.method", "http.url", "http.status_code"}
        span.attributes = {
            k: v for k, v in span.attributes.items()
            if k in allowed_keys
        }

This SpanProcessor limits attributes to http.method, http.url, and http.status_code — dropping user.id, session tokens, and every other high-cardinality field. The backend stays stable.

4. Memory Usage Grows Unbounded

The Collector's memory consumption grows linearly with every span. Under sustained load, 512 MiB becomes 1 GiB, then 2 GiB. The OOM killer strikes.

Fix: Configure the memory_limiter processor aggressively:

processors:
  memory_limiter:
    limit_mib: 256
    spike_limit_mib: 512
    check_interval: 1s

The limit_mib sets a hard cap at 256 MiB. The spike_limit_mib allows brief spikes to 512 MiB during batch exports. Set both lower than the container memory limit if the Collector also runs a sidecar.

Security: Redacting Sensitive Data

OpenTelemetry traces can leak secrets. A span attribute like credit_card_number or user.email travels from your SDK through the Collector to Jaeger — and into your observability vendor's cloud. Every hop stores the attribute permanently.

Prevention: Filter sensitive attributes at the Collector level before they leave your network:

processors:
  attributes:
    actions:
      - key: user.email
        action: delete
      - key: user.phone
        action: delete
      - key: credit_card.*
        action: delete
      - key: password
        action: delete

This configuration strips user.email, user.phone, and any attribute matching the pattern credit_card.* or password from every span before it reaches the exporter. The sensitive data never leaves your boundary. Combine this with the k8sattributes processor to redact by label or annotation.

For full defense in depth, review the OTel Security documentation.

Further Reading

If you have made it this far, you now have a working OpenTelemetry pipeline — instrumentation, a Collector, and at least one observability backend. Here is where to go next:

  • Kubernetes Security Best Practices 2026 — Hardening your cluster before instrumenting your workloads. Security is not optional when observability is production.
  • Error Budgets: Stop Wasting Your SRE Team's Time — Budget for reliability, not just velocity. Your error budget is a policy decision, not a suggestion.
  • OpenTelemetry Tracing: Instrument Your First Application (forthcoming) — Distributed tracing with manual context propagation. A complete guide to instrumenting every service.

Conclusion

OpenTelemetry is not a tool — it is a standard. Adopting it means instrumenting once with the SDK, processing through the Collector, and exporting to any backend without rewriting code. You have now walked through a complete setup: instrumentation with Python and Flask, Collector configuration in YAML, Jaeger for trace visualization, Prometheus for metrics, Kubernetes for production deployment, and operational patterns from DaemonSet to Gateway.

The most important things to remember:

  1. Instrument once, export anywhere. The OTel SDK decouples your application from every backend. Changing exporters in the Collector config is not a code change.
  2. The Collector is your control plane. Receivers, processors, and exporters form a pipeline. Data flows one way — from your code through the SDK to the Collector, then to Jaeger and Prometheus. You control the flow.
  3. Tail sampling saves budget. Not every span is worth storing. Decide what to keep at the head (probabilistic) or at the tail (policy-based). The Collector makes the decision.

The observability landscape in 2026 is converging on OpenTelemetry. Every major vendor now speaks OTLP. The standard is the protocol — adopt it before it becomes a migration project.

#opentelemetry#observability#distributed-tracing#prometheus#jaeger#devops
D
DevToCashAuthor

Senior DevOps/SRE Engineer · 10+ years · Professional Trader (IDX, Crypto, US Equities)

I write about real infrastructure patterns and trading strategies I use in production and in live markets. No courses, no affiliate hype — just documentation of what actually works.

More about me →