OpenTelemetry Distributed Tracing: From Zero to Production

Introduction

A user clicks "Checkout" in your e-commerce app. The request hits your API gateway, calls the cart service, which calls inventory, which calls payment. If payment is slow, where's the bottleneck? Without distributed tracing, you're guessing.

OpenTelemetry (OTel) is the CNCF standard for collecting traces, metrics, and logs. It's vendor-neutral—instrument once, export anywhere.

This guide gets you from zero to production tracing with OpenTelemetry: collector setup, Node.js and Python instrumentation, and exporting to Jaeger and Grafana Tempo.

OpenTelemetry Architecture

OpenTelemetry has three components:

Instrumentation libraries — Auto-generate spans in your application code
OTel Collector — Receives, processes, and exports telemetry data
Backend — Jaeger, Grafana Tempo, Honeycomb, or any OTLP-compatible system

The flow:

App (OTel SDK) → OTLP → Collector → Jaeger/Tempo

Deploy the OTel Collector

# docker-compose.yaml
version: '3.8'
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "8888:8888"   # Metrics

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "14250:14250"  # gRPC

Collector configuration:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger, logging]

Instrumenting Applications

Node.js

Install the OTel packages:

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-grpc

Create a tracing initialization file:

// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4317'
  }),
  instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start();

// Graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown().then(() => console.log('Tracing terminated'));
});

Load it before your application:

node --require ./tracing.js server.js

Auto-instrumentation covers Express, HTTP, gRPC, Redis, PostgreSQL, MongoDB, and more—automatically.

Python

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

Configure via environment variables:

export OTEL_SERVICE_NAME=cart-service
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_TRACES_EXPORTER=otlp
opentelemetry-instrument python app.py

Manual Instrumentation

For custom business logic, create spans manually:

const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('checkout-service');

async function processCheckout(cartId) {
  return tracer.startActiveSpan('process-checkout', async (span) => {
    span.setAttribute('cart.id', cartId);
    try {
      const result = await chargeCustomer(cartId);
      span.setAttribute('checkout.success', true);
      return result;
    } catch (err) {
      span.setAttribute('checkout.error', err.message);
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

Exporting to Backends

Jaeger (Local/Dev)

Already configured in our collector above. Access at http://localhost:16686.

Grafana Tempo (Production)

exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

In Grafana, add Tempo as a data source pointing to http://tempo:3200. Now drill from metrics (Prometheus) into traces (Tempo) with one click.

Sampling in Production

100% tracing in production is expensive. Use probabilistic sampling:

processors:
  probabilistic_sampler:
    sampling_percentage: 10  # Trace 10% of requests

Or tail-based sampling to keep traces with errors:

processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: latency
        type: latency
        latency: {threshold_ms: 1000}

Sampling Strategies for Production

Tracing 100% of requests in production is expensive and unnecessary. Smart sampling reduces costs while preserving signal.

Probabilistic (Head-Based) Sampling

Decides whether to sample a trace when it starts. Fixed percentage, easy to implement:

# In OTel Collector config
processors:
  probabilistic_sampler:
    sampling_percentage: 5  # Sample 5% of all requests

Best for high-volume services where errors are uniformly distributed. Set higher percentages for critical services (payment, auth) and lower for non-critical (logging, analytics).

Tail-Based Sampling

Keeps all traces temporarily, then decides which to retain based on properties:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: high-latency
        type: latency
        latency: {threshold_ms: 2000}
      - name: critical-endpoints
        type: and
        and:
          - name: http-path-match
            type: string_attribute
            string_attribute: {key: http.route, values: ["/api/checkout", "/api/payment"]}

This keeps every error trace and every slow trace, plus all traces from critical endpoints. Everything else is dropped.

Sampling Comparison

Strategy	Cost	Error Capture	Slow Trace Capture	Complexity
100% sampling	Highest	Perfect	Perfect	None
Probabilistic (5%)	Low	Misses 95% of errors	Misses 95% of slow	Low
Tail-based (errors+slow)	Medium	All errors	All slow traces	Medium
Rate limiting (10 traces/sec)	Low	Depends	Depends	Low

Recommendation: Use probabilistic sampling at 5-10% for all services, plus tail-based for services where you need error and latency guarantees.

Manual Instrumentation: Beyond Auto-Instrumentation

Auto-instrumentation covers HTTP, databases, and queues. For business logic, add manual spans.

Node.js: Creating Custom Spans

const { trace, context, SpanStatusCode } = require('@opentelemetry/api');

async function processRefund(orderId, amount) {
  const tracer = trace.getTracer('payment-service');
  const parentSpan = trace.getSpan(context.active());

  return tracer.startActiveSpan('process-refund', (span) => {
    span.setAttribute('order.id', orderId);
    span.setAttribute('refund.amount', amount);

    try {
      const result = await refundProvider(amount);
      span.setAttribute('refund.success', true);
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (err) {
      span.setAttribute('refund.error', err.message);
      span.recordException(err);
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();
    }
  });
}

Python: Adding Attributes and Events

from opentelemetry import trace

tracer = trace.get_tracer_provider().get_tracer("checkout-service")

with tracer.start_as_current_span("apply-coupon") as span:
    span.set_attribute("coupon.code", coupon_code)
    span.set_attribute("cart.total", cart_total)
    span.add_event("coupon-validated", {"discount": discount_amount})

    if has_expired:
        span.set_status(trace.Status(trace.StatusCode.ERROR, "coupon expired"))
        return {"error": "Coupon expired"}

Connecting Traces to Logs and Metrics

The real power of OpenTelemetry comes from trace-log-metric correlation. When a Prometheus alert fires, you should jump directly to the relevant traces.

Log Correlation

Inject trace context into your logs:

// Node.js: Add trace ID to all log entries
const { trace } = require('@opentelemetry/api');

function createLogger(service) {
  return {
    info: (msg, attrs = {}) => {
      const span = trace.getSpan(context.active());
      console.log(JSON.stringify({
        service,
        level: 'info',
        message: msg,
        trace_id: span?.spanContext().traceId,
        span_id: span?.spanContext().spanId,
        ...attrs
      }));
    }
  };
}

Prometheus Exemplars

Exemplars link metrics to traces. Configure Prometheus to accept exemplars:

# prometheus.yml
storage:
  exemplars:
    max_exemplars: 100000

Then configure the OTel Collector to export exemplars:

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    add_metric_suffixes: false
    enable_open_metrics: true
    resource_to_telemetry_conversion:
      enabled: true

In Grafana, click a data point on a latency graph to open the corresponding trace in Tempo. This is the observability workflow every SRE team should have.

Deploying the OTel Collector in Production

A single collector instance is fine for development, but production requires a robust deployment topology.

Collector as a Sidecar

Run a collector alongside each application pod in Kubernetes:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: app
          image: myapp:latest
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:latest
          args: ["--config=/etc/otel-config.yaml"]
          ports:
            - containerPort: 4317
            - containerPort: 4318
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi

Each sidecar collector handles one application's telemetry, providing isolation and independent scaling.

Collector as a Gateway

For centralized processing, deploy a standalone collector deployment:

# collector-gateway.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-gateway
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:latest
          args: ["--config=/etc/otel-gateway.yaml"]
          ports:
            - containerPort: 4317
            - containerPort: 4318
          resources:
            requests:
              cpu: 1
              memory: 2Gi

Sidecar collectors forward to the gateway, which handles batching, filtering, and export. This topology scales to thousands of pods.

Common Production Pitfalls

Pitfall	Symptom	Fix
No rate limiting	Collector OOM during traffic spikes	Add memory_limiter processor
Missing attributes	Cannot filter traces by service	Enforce attribute requirements in collector
Too many spans	$10k+ monthly vendor bill	Implement tail-based sampling
Queues not configured	Trace data loss during network issues	Set queued_retry with `max_elapsed_time: 60s`
No health checks	Silent collector failures	Enable pprof extension for debugging

# Production collector config with safeguards
processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  batch:
    timeout: 1s
    send_batch_size: 8192
  queued_retry:
    num_workers: 4
    queue_size: 5000
    max_elapsed_time: 60s

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: 0.0.0.0:1777

Choosing an Observability Backend

Backend	Strengths	Best For
Jaeger	Simple setup, UI focused on traces	Small teams, dev/staging
Grafana Tempo	Scales to petabyte traces, integrates with Grafana	Production, multi-service
Honeycomb	High cardinality, SLO-based alerting	Teams doing production debugging
Datadog	Full APM with traces, metrics, logs	Enterprise, compliance-heavy
SigNoz	Open-source full stack, built on OpenTelemetry	Cost-conscious teams, self-hosted

All accept OTLP, so switching backends requires zero application code changes. This is the OpenTelemetry advantage.

Conclusion

OpenTelemetry eliminates vendor lock-in for observability. Instrument once, export anywhere—Jaeger today, Honeycomb tomorrow, no code changes.

Start with auto-instrumentation and the OTel Collector as a sidecar. It gives you 80% of the value (HTTP, database, queue tracing) with nearly zero code. Add manual instrumentation for critical business flows.

The real power kicks in when you connect traces with logs and metrics in Grafana. One click from a latency spike to the exact slow database query causing it.