Introduction
SLI/SLO theory is easy. Implementation is hard. Most teams define SLOs on a whiteboard — "99.9% uptime!" — and never wire them into their monitoring. When an incident hits, they scramble through dashboards, grep logs, and guess whether to escalate.
This guide builds a complete SLI/SLO implementation with Prometheus recording rules, Grafana dashboards, and burn rate alerts. By the end, you will have SLOs that wake you up only when it matters.
If you need the conceptual foundation first, start with our SLI vs SLO vs SLA guide. Then return here for the implementation.
Architecture Overview
The implementation pipeline has four stages:
- Step 1: SLI measurement. Prometheus recording rules compute latency P99, error rate, and throughput from raw application metrics.
- Step 2: SLO calculation. Prometheus rules compute error budget remaining from SLI data.
- Step 3: Grafana visualization. SLO dashboards show budget burn rate, remaining budget, and compliance over time.
- Step 4: Alerting. Prometheus alert rules trigger when burn rate exceeds threshold.
All of this lives in a single slo-rules.yaml file and a single Grafana dashboard JSON — no external services, no SaaS dependency.
Step 1: SLI Recording Rules
Create slo-rules.yml in your Prometheus config directory. Start with the four golden signals:
groups:
- name: sli.rules
interval: 30s
rules:
- record: job:latency_seconds:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))
- record: job:error_rate:ratio5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
- record: job:availability:ratio5m
expr: |
1 - (
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
)
- record: job:throughput:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
Key design decisions:
by (job)ensures per-service SLIs. One set of rules serves all your services. Changejobtoserviceornamespaceto match your label topology.histogram_quantile(0.99, ...)not average. Users at P99 experience your worst performance. Average latency hides the tail.availabilityis error_rate inverted. "99.9% available" is clearer for stakeholders than "0.1% error rate."
These rules produce four clean SLI metrics per service. Apply them:
kubectl apply -f slo-rules.yml -n monitoring
# Or with Prometheus operator:
kubectl apply -f slo-rules.yml
Verify in the Prometheus UI (Targets → Rules) that all rules are evaluating.
Step 2: SLO Calculation Rules
SLOs answer: "How much of our error budget is left this month?" Add these rules to the same file:
- name: slo.rules
interval: 60s
rules:
- record: job:slo:error_budget_remaining_ratio
expr: |
1 - (
(1 - slo_target)
/
(
sum(rate(http_requests_total{status=~"5.."}[30d])) by (job)
/
sum(rate(http_requests_total[30d])) by (job)
)
)
- record: job:slo:burn_rate_1h
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h])) by (job)
/
sum(rate(http_requests_total[1h])) by (job)
)
*
(3600 * 24 * 30 / 3600)
/
(1 - slo_target)
- record: job:slo:burn_rate_6h
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h])) by (job)
/
sum(rate(http_requests_total[6h])) by (job)
)
*
(3600 * 24 * 30 / 21600)
/
(1 - slo_target)
Explanation of the math:
slo_targetis a Prometheus constant representing your SLO (e.g.,0.999for 99.9%).- Error budget remaining ratio = 1 - (allowed errors / actual errors). When this hits 0, the budget is exhausted.
- Burn rate = actual error consumption rate / allowed error consumption rate. A burn rate of 1 means consuming budget exactly on schedule. Burn rate of 14.4 means consuming 14.4x too fast — the budget will exhaust in hours, not days.
The burn rate metrics are the foundation of meaningful alerting. Instead of "error rate > 1%," you alert on "burn rate > 14.4 for the last hour" — two conditions that together mean "this is consuming budget fast enough to matter."
For the complete error budget methodology, our error budgets SRE guide explains the theory behind these numbers and how to negotiate SLOs with product teams.
Step 3: Grafana Dashboard
Create a Grafana dashboard with these panels:
Panel 1: SLO Compliance Gauge
PromQL queries for a gauge showing "99.9% SLO: 87% compliant this month":
# Current error budget remaining (as percentage)
100 * job:slo:error_budget_remaining_ratio{job="payment-service"}
# SLO target line (constant)
100 * 0.999
Configure the gauge to show green above 50%, yellow 25-50%, red below 25%.
Panel 2: Burn Rate Over Time
# 1-hour burn rate (spike detection)
job:slo:burn_rate_1h{job="payment-service"}
# 6-hour burn rate (sustained detection)
job:slo:burn_rate_6h{job="payment-service"}
Add horizontal thresholds at 1 (expected), 6 (significant), and 14.4 (critical). When the 1h line crosses 14.4, an alert should fire.
Panel 3: Error Budget Timeline
# Error budget remaining over the last 30 days
job:slo:error_budget_remaining_ratio{job="payment-service"}
# Expected burn rate (linear from 100% to 0% over 30 days)
predict_linear(
job:slo:error_budget_remaining_ratio{job="payment-service"}[30d],
30 * 86400
)
The predict_linear line shows where the budget will be in 30 days if the current trend continues. If it crosses 0 before day 30, you have a problem.
Panel 4: SLI Detail (Error Rate, Latency, Throughput)
Three individual panels showing the raw SLIs:
# Error rate
job:error_rate:ratio5m{job="payment-service"}
# P99 latency
job:latency_seconds:p99{job="payment-service"}
# Throughput
job:throughput:rate5m{job="payment-service"}
Export the dashboard as JSON from Grafana and commit it to your infrastructure repo. Infrastructure as code includes dashboards.
Step 4: Burn Rate Alerting Rules
Prometheus alert rules trigger when burn rate exceeds critical thresholds:
groups:
- name: slo.alerts
rules:
- alert: SLOErrorBudgetCriticalBurn
expr: |
job:slo:burn_rate_1h > 14.4
and
job:slo:burn_rate_6h > 6.0
for: 5m
labels:
severity: page
annotations:
summary: "SLO error budget burning critically for {{ $labels.job }}"
description: |
1h burn rate: {{ $value | humanize }}
6h burn rate: {{ $labels.job:slo:burn_rate_6h }}
Error budget remaining: {{ $labels.job:slo:error_budget_remaining_ratio | humanizePercentage }}
- alert: SLOErrorBudgetWarning
expr: |
job:slo:burn_rate_6h > 1.0
and
job:slo:error_budget_remaining_ratio < 0.90
for: 30m
labels:
severity: ticket
annotations:
summary: "SLO error budget burning above expected for {{ $labels.job }}"
Why multi-window alerts?
The and clause between 1h and 6h burn rates eliminates false positives. A single spike at 15x burn rate for 2 minutes — which could be a transient network blip — will NOT trigger the alert because the 6h window smooths it out. Only sustained budget consumption triggers pages.
This is the core insight from Google's SRE alerting methodology: alert on symptoms, not causes, and only when budget is at risk.
Step 5: Error Budget Policy
With SLOs instrumented, define what happens when budget burns:
# error-budget-policy.yml — stored in git alongside slo-rules
policies:
- name: production-critical
slo_target: 0.999
windows: [30d]
actions:
- condition: "error_budget_remaining > 0.50"
action: "feature_velocity_normal"
description: "Ship features. Reliability on track."
- condition: "error_budget_remaining > 0.25"
action: "feature_freeze"
description: "All eng time to reliability improvements. Escalate to EM."
- condition: "error_budget_remaining <= 0.25"
action: "incident_mode"
description: "Halt all changes. Incident commander on-call."
This is not automation — it is a social contract stored in git, next to the SLO definitions. When the Grafana dashboard shows red, the team knows what to do without a meeting.
Production Considerations
- Run SLO rules on a dedicated Prometheus. SLO queries span 30-day windows, which can be expensive. Use a separate Prometheus instance with long retention (Thanos/Cortex/Mimir) for SLO data.
- Downsampling for long windows. Querying 30-day raw data for 50 services creates load. Use recording rules with 6h resolution for the 30-day window and combine with 5m resolution for the 1h window.
- Multi-tenancy. If you run SLOs for multiple teams, use label-based tenancy in Grafana. Each team sees only their
joblabels. - GitOps everything. Prometheus rules, Grafana dashboards, and error budget policies live in git. Changes go through PR review — no manual edits in the UI.
For teams implementing distributed tracing alongside SLOs, our OpenTelemetry setup guide covers instrumenting applications with traces that correlate with these SLO metrics. When latency SLO burns, follow the trace to find the slow service.
SLO Implementation Checklist
- SLI recording rules applied and evaluating (verify in Prometheus UI)
- SLO calculation rules producing error budget metrics
- Grafana dashboard showing compliance gauge, burn rate, and error budget timeline
- Multi-window burn rate alerts configured and tested (trigger a canary alert)
- Error budget policy documented in git
- On-call team understands: "when SLO alert fires, stop features and fix reliability"
SLOs are not a monitoring feature — they are an engineering discipline. When every service has a measured, alertable SLO, the question "should we wake up at 3 AM for this?" has a data-driven answer.