Introduction
A user clicks "Checkout" in your e-commerce app. The request hits your API gateway, calls the cart service, which calls inventory, which calls payment. If payment is slow, where's the bottleneck? Without distributed tracing, you're guessing.
OpenTelemetry (OTel) is the CNCF standard for collecting traces, metrics, and logs. It's vendor-neutral—instrument once, export anywhere.
This guide gets you from zero to production tracing with OpenTelemetry: collector setup, Node.js and Python instrumentation, and exporting to Jaeger and Grafana Tempo.
OpenTelemetry Architecture
OpenTelemetry has three components:
- Instrumentation libraries — Auto-generate spans in your application code
- OTel Collector — Receives, processes, and exports telemetry data
- Backend — Jaeger, Grafana Tempo, Honeycomb, or any OTLP-compatible system
The flow:
App (OTel SDK) → OTLP → Collector → Jaeger/Tempo
Deploy the OTel Collector
# docker-compose.yaml
version: '3.8'
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8888:8888" # Metrics
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "14250:14250" # gRPC
Collector configuration:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
logging:
loglevel: debug
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger, logging]
Instrumenting Applications
Node.js
Install the OTel packages:
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc
Create a tracing initialization file:
// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4317'
}),
instrumentations: [getNodeAutoInstrumentations()]
});
sdk.start();
// Graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown().then(() => console.log('Tracing terminated'));
});
Load it before your application:
node --require ./tracing.js server.js
Auto-instrumentation covers Express, HTTP, gRPC, Redis, PostgreSQL, MongoDB, and more—automatically.
Python
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
Configure via environment variables:
export OTEL_SERVICE_NAME=cart-service
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_TRACES_EXPORTER=otlp
opentelemetry-instrument python app.py
Manual Instrumentation
For custom business logic, create spans manually:
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('checkout-service');
async function processCheckout(cartId) {
return tracer.startActiveSpan('process-checkout', async (span) => {
span.setAttribute('cart.id', cartId);
try {
const result = await chargeCustomer(cartId);
span.setAttribute('checkout.success', true);
return result;
} catch (err) {
span.setAttribute('checkout.error', err.message);
span.recordException(err);
throw err;
} finally {
span.end();
}
});
}
Exporting to Backends
Jaeger (Local/Dev)
Already configured in our collector above. Access at http://localhost:16686.
Grafana Tempo (Production)
exporters:
otlp:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
In Grafana, add Tempo as a data source pointing to http://tempo:3200. Now drill from metrics (Prometheus) into traces (Tempo) with one click.
Sampling in Production
100% tracing in production is expensive. Use probabilistic sampling:
processors:
probabilistic_sampler:
sampling_percentage: 10 # Trace 10% of requests
Or tail-based sampling to keep traces with errors:
processors:
tail_sampling:
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: latency
type: latency
latency: {threshold_ms: 1000}
Sampling Strategies for Production
Tracing 100% of requests in production is expensive and unnecessary. Smart sampling reduces costs while preserving signal.
Probabilistic (Head-Based) Sampling
Decides whether to sample a trace when it starts. Fixed percentage, easy to implement:
# In OTel Collector config
processors:
probabilistic_sampler:
sampling_percentage: 5 # Sample 5% of all requests
Best for high-volume services where errors are uniformly distributed. Set higher percentages for critical services (payment, auth) and lower for non-critical (logging, analytics).
Tail-Based Sampling
Keeps all traces temporarily, then decides which to retain based on properties:
processors:
tail_sampling:
decision_wait: 10s
num_traces: 50000
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: high-latency
type: latency
latency: {threshold_ms: 2000}
- name: critical-endpoints
type: and
and:
- name: http-path-match
type: string_attribute
string_attribute: {key: http.route, values: ["/api/checkout", "/api/payment"]}
This keeps every error trace and every slow trace, plus all traces from critical endpoints. Everything else is dropped.
Sampling Comparison
| Strategy | Cost | Error Capture | Slow Trace Capture | Complexity |
|---|---|---|---|---|
| 100% sampling | Highest | Perfect | Perfect | None |
| Probabilistic (5%) | Low | Misses 95% of errors | Misses 95% of slow | Low |
| Tail-based (errors+slow) | Medium | All errors | All slow traces | Medium |
| Rate limiting (10 traces/sec) | Low | Depends | Depends | Low |
Recommendation: Use probabilistic sampling at 5-10% for all services, plus tail-based for services where you need error and latency guarantees.
Manual Instrumentation: Beyond Auto-Instrumentation
Auto-instrumentation covers HTTP, databases, and queues. For business logic, add manual spans.
Node.js: Creating Custom Spans
const { trace, context, SpanStatusCode } = require('@opentelemetry/api');
async function processRefund(orderId, amount) {
const tracer = trace.getTracer('payment-service');
const parentSpan = trace.getSpan(context.active());
return tracer.startActiveSpan('process-refund', (span) => {
span.setAttribute('order.id', orderId);
span.setAttribute('refund.amount', amount);
try {
const result = await refundProvider(amount);
span.setAttribute('refund.success', true);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.setAttribute('refund.error', err.message);
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
throw err;
} finally {
span.end();
}
});
}
Python: Adding Attributes and Events
from opentelemetry import trace
tracer = trace.get_tracer_provider().get_tracer("checkout-service")
with tracer.start_as_current_span("apply-coupon") as span:
span.set_attribute("coupon.code", coupon_code)
span.set_attribute("cart.total", cart_total)
span.add_event("coupon-validated", {"discount": discount_amount})
if has_expired:
span.set_status(trace.Status(trace.StatusCode.ERROR, "coupon expired"))
return {"error": "Coupon expired"}
Connecting Traces to Logs and Metrics
The real power of OpenTelemetry comes from trace-log-metric correlation. When a Prometheus alert fires, you should jump directly to the relevant traces.
Log Correlation
Inject trace context into your logs:
// Node.js: Add trace ID to all log entries
const { trace } = require('@opentelemetry/api');
function createLogger(service) {
return {
info: (msg, attrs = {}) => {
const span = trace.getSpan(context.active());
console.log(JSON.stringify({
service,
level: 'info',
message: msg,
trace_id: span?.spanContext().traceId,
span_id: span?.spanContext().spanId,
...attrs
}));
}
};
}
Prometheus Exemplars
Exemplars link metrics to traces. Configure Prometheus to accept exemplars:
# prometheus.yml
storage:
exemplars:
max_exemplars: 100000
Then configure the OTel Collector to export exemplars:
exporters:
prometheus:
endpoint: 0.0.0.0:8889
add_metric_suffixes: false
enable_open_metrics: true
resource_to_telemetry_conversion:
enabled: true
In Grafana, click a data point on a latency graph to open the corresponding trace in Tempo. This is the observability workflow every SRE team should have.
Deploying the OTel Collector in Production
A single collector instance is fine for development, but production requires a robust deployment topology.
Collector as a Sidecar
Run a collector alongside each application pod in Kubernetes:
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
image: myapp:latest
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
args: ["--config=/etc/otel-config.yaml"]
ports:
- containerPort: 4317
- containerPort: 4318
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
Each sidecar collector handles one application's telemetry, providing isolation and independent scaling.
Collector as a Gateway
For centralized processing, deploy a standalone collector deployment:
# collector-gateway.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-gateway
spec:
replicas: 3
template:
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
args: ["--config=/etc/otel-gateway.yaml"]
ports:
- containerPort: 4317
- containerPort: 4318
resources:
requests:
cpu: 1
memory: 2Gi
Sidecar collectors forward to the gateway, which handles batching, filtering, and export. This topology scales to thousands of pods.
Common Production Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| No rate limiting | Collector OOM during traffic spikes | Add memory_limiter processor |
| Missing attributes | Cannot filter traces by service | Enforce attribute requirements in collector |
| Too many spans | $10k+ monthly vendor bill | Implement tail-based sampling |
| Queues not configured | Trace data loss during network issues | Set queued_retry with max_elapsed_time: 60s |
| No health checks | Silent collector failures | Enable pprof extension for debugging |
# Production collector config with safeguards
processors:
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
batch:
timeout: 1s
send_batch_size: 8192
queued_retry:
num_workers: 4
queue_size: 5000
max_elapsed_time: 60s
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: 0.0.0.0:1777
Choosing an Observability Backend
| Backend | Strengths | Best For |
|---|---|---|
| Jaeger | Simple setup, UI focused on traces | Small teams, dev/staging |
| Grafana Tempo | Scales to petabyte traces, integrates with Grafana | Production, multi-service |
| Honeycomb | High cardinality, SLO-based alerting | Teams doing production debugging |
| Datadog | Full APM with traces, metrics, logs | Enterprise, compliance-heavy |
| SigNoz | Open-source full stack, built on OpenTelemetry | Cost-conscious teams, self-hosted |
All accept OTLP, so switching backends requires zero application code changes. This is the OpenTelemetry advantage.
Conclusion
OpenTelemetry eliminates vendor lock-in for observability. Instrument once, export anywhere—Jaeger today, Honeycomb tomorrow, no code changes.
Start with auto-instrumentation and the OTel Collector as a sidecar. It gives you 80% of the value (HTTP, database, queue tracing) with nearly zero code. Add manual instrumentation for critical business flows.
The real power kicks in when you connect traces with logs and metrics in Grafana. One click from a latency spike to the exact slow database query causing it.