Introduction
SLOs (Service Level Objectives) are the core reliability contract between an SRE team and their users. Without SLOs, every incident feels urgent. With SLOs, you know whether to wake up at 3 AM — or to fix it Monday morning.
This guide covers defining SLOs in Prometheus, setting up burn rate alerts, and building error budget policies that prevent both over-alerting and under-investing in reliability.
SLIs: How to Measure the Right Things
An SLI (Service Level Indicator) is a metric that matters to users. The four golden signals:
- Latency. P99 response time of HTTP requests. Not average — users at P99.9 are the ones writing angry tweets.
- Error rate. Percentage of requests returning 5xx. Exclude 4xx (client errors) unless they represent a service-side problem.
- Throughput. Requests per second. Drops in throughput are early warning signals.
- Saturation. CPU, memory, connection pool exhaustion. The "how full is the bucket" metrics.
Prometheus recording rules for SLIs:
groups:
- name: sli.rules
rules:
- record: job:request_latency_seconds:p99
expr: histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m]))
- record: job:error_rate:ratio5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
These recording rules produce clean SLI metrics — ready for SLO calculations.
SLOs: Turning SLIs Into Commitments
An SLO is a target value for an SLI over a window. The canonical SRE SLO: "99.9% of requests succeed over 30 days." Define SLOs as Prometheus recording rules:
- record: job:slo_error_budget_remaining:ratio30d
expr: |
1 - (1 - 0.999)
/
(
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)
When this ratio drops below 1.0, the error budget for the 30-day window is exhausted. When it drops below 0.5, half the budget is burned.
Burn Rate Alerts: Stop Alerting on Every Error
Raw error rate alerts are noise. A brief spike may resolve in 30 seconds. Burn rate alerts use a rolling window to alert only when error budget is being consumed fast enough to matter:
groups:
- name: slo.alerts
rules:
- alert: HighErrorBurnRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
> 14.4 * (1 - 0.999)
for: 2m
labels:
severity: page
annotations:
summary: "Error budget burn rate critical"
The 14.4 multiplier: if you exhaust a 0.1% error budget in 1 hour instead of 30 days, you are burning 720x too fast. This filter ensures only budget-threatening events trigger pages — following the alerting methodology from the Google SRE workbook.
For fast-burning incidents, add a secondary alert:
- alert: MediumErrorBurnRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
> 1.0 * (1 - 0.999)
for: 15m
labels:
severity: ticket
This pages when the trend will exhaust the budget in under 24 hours — giving the team one working day to respond before operations are impacted.
Error Budget Policies
Error budget policies are pre-agreed rules that govern team behavior:
- Budget above 50%: Ship features at full speed. Reliability is within target.
- Budget 25-50%: Freeze feature work. All engineering time goes to reliability improvements.
- Budget below 25%: Halt all changes. Escalate to leadership. Incident response mode.
These policies prevent the "we will fix reliability next sprint" trap that burns out SRE teams.
Multi-Window Burn Rate Alerts
Google's SRE workbook recommends multi-window alerts for high-severity SLOs. Combine a short window (1h) for detection speed with a long window (6h) for noise reduction:
- alert: CriticalErrorBurnRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
> 14.4 * (1 - 0.999)
)
and
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))
> 6.0 * (1 - 0.999)
)
for: 5m
labels:
severity: page
Both conditions must be true: the 1-hour burn rate is critical AND the 6-hour trend confirms it. This eliminates the majority of false positives while catching real incidents within 5 minutes.
For the full SLO framework — from SLI definition through error budget governance — see our SLI vs SLO vs SLA guide. For the incident management process that activates when budget burns, our SRE incident management runbook provides templates and escalation paths.
SLOs are not about perfection. They are about making data-driven decisions on when to ship features and when to fix production.