Introduction
Distributed tracing answers the hardest question in microservices: "Why is this request slow?" A single user request can touch 12 services. Without tracing, you debug by grep-ing logs across 12 different dashboards and guessing.
OpenTelemetry is the CNCF standard for distributed tracing. It gives you end-to-end request visibility across services, languages, and infrastructure — with or without code changes.
This guide covers auto-instrumentation for Go, Python, and Node.js, plus OTLP export to Jaeger and Grafana Tempo.
Auto-Instrumentation: Traces Without Code Changes
The OpenTelemetry Operator injects instrumentation into your pods without modifying a single line of application code:
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
Annotate your namespace or deployment to enable auto-instrumentation:
apiVersion: v1
kind: Pod
metadata:
annotations:
instrumentation.opentelemetry.io/inject-java: "true"
instrumentation.opentelemetry.io/inject-python: "true"
instrumentation.opentelemetry.io/inject-nodejs: "true"
The operator injects an init container with the OpenTelemetry agent. When the application starts, the agent attaches to the runtime and instruments HTTP, gRPC, database calls, and message queues — automatically.
For Go applications (which compile instrumentation into the binary), use the SDK directly:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
)
func initTracer() {
exporter, _ := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("tempo:4317"),
otlptracegrpc.WithInsecure(),
)
tp := trace.NewTracerProvider(trace.WithBatcher(exporter))
otel.SetTracerProvider(tp)
}
Export to Tempo and Jaeger
OpenTelemetry uses OTLP as the wire protocol. Both Tempo and Jaeger accept OTLP:
# OpenTelemetry Collector pipeline
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp/tempo]
Deploy the Collector as a DaemonSet — one per node — to handle trace volume without adding latency. Applications send traces to localhost:4317 and the Collector batches and exports to Tempo.
Sampling: Don't Store Every Trace
At scale, tracing 100% of requests is cost-prohibitive. Use intelligent sampling:
- Head sampling (probabilistic): Decide at trace start. 10% sampling means 10% of requests are traced. Simple but misses rare slow requests.
- Tail sampling (Collector): Decide after the trace completes. Sample 100% of traces with errors or latency above P99. This is the SRE-relevant approach — you capture every incident trace.
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: latency
type: latency
latency: {threshold_ms: 1000}
- name: default
type: probabilistic
probabilistic: {sampling_percentage: 10}
Tracing in Production: The SRE Checklist
- Export to Tempo (cost-effective object storage backend) or Jaeger (Elasticsearch/Cassandra)
- Enable tail sampling — capture every error and slow trace, discard healthy fast traffic
- Correlate traces with logs via trace ID injection in structured logging
- Use span attributes for business context:
user.id,order.id,tenant.id
For the broader observability picture — combining traces with metrics and logs — our OpenTelemetry complete setup guide covers the full three-pillar implementation, including metrics export to Prometheus and logs via the OTel filelog receiver.
For teams adopting eBPF-based observability alongside OpenTelemetry, our eBPF observability for SRE guide shows how kernel-level telemetry complements application-level tracing.
OpenTelemetry tracing turns every request into a story — from ingress to database and back. When the next incident hits, you will not be grep-ing logs. You will be following a trace.