Introduction
SLO, SLI, SLA. Three acronyms that form the foundation of Site Reliability Engineering. Yet most engineers confuse them—or worse, set them arbitrarily without understanding the consequences.
Here's the one-sentence version: SLIs are what you measure, SLOs are the target you promise, and SLAs are the contractual consequences of missing that target.
This guide makes these concepts concrete with real examples and shows you how to implement SLO-based alerting.
SLI: Service Level Indicator
An SLI is a quantitative measure of some aspect of your service. It answers: "What are we measuring?"
Common SLIs:
| Category | SLI Example | How to Measure |
|----------|------------|----------------|
| Availability | Proportion of successful requests | 200-499 responses / total requests |
| Latency | P99 request duration | Histogram of response times |
| Throughput | Requests per second | Counter of requests |
| Error rate | Proportion of failed requests | 500+ responses / total requests |
| Freshness | Data staleness | now() - last_update_timestamp |
| Durability | Data retention | Successful writes / total writes |
Choosing Good SLIs
Bad SLI: "The system should be fast."
Good SLI: "99% of checkout API requests complete in under 500ms, measured over a 5-minute rolling window."
Every SLI needs:
- A metric — What to measure (latency, error rate, etc.)
- A threshold — What counts as "good" (under 500ms)
- A window — Over what time period (5 minutes)
Implementation with Prometheus:
# SLI: Proportion of fast requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
SLO: Service Level Objective
An SLO is a target value for an SLI over a time period. It answers: "How good does it need to be?"
Example: "99.9% of checkout API requests complete in under 500ms, measured over a 30-day rolling window."
Setting SLOs: Don't Start at 99.999%
Set SLOs based on user expectations, not engineering ambition:
| Service | Typical SLO | Why | |---------|------------|-----| | User-facing web app | 99.9% | Users notice but don't abandon at 99.9% | | Payment processing | 99.99% | Revenue directly impacted | | Internal admin dashboard | 99% | Only employees affected | | Batch processing pipeline | 99.5% | Delayed data acceptable | | CI/CD pipeline | 99% | Developers can retry |
The golden rule: Set your SLO just below what users actually notice. If users complain at 99% availability, set SLO at 99.5%. This gives you an error budget to spend on innovation.
Error Budget
Error budget = 100% - SLO. For a 99.9% SLO over 30 days:
Allowed downtime = 30 days × 24 hours × 60 minutes × 0.1%
= 43.2 minutes per month
Track your error budget burn rate:
# Error budget remaining this month
1 - (
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
) / (1 - 0.999)
SLO-Based Alerting
Traditional alerting: "CPU > 80% → page on-call"
SLO-based alerting: "We're burning error budget 10x faster than normal → investigate during business hours. Burn rate is 100x → page immediately."
# Prometheus alert rule
groups:
- name: slo-alerts
rules:
- alert: HighErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > 0.001 # 10x burn rate for 99.9% SLO
for: 5m
labels:
severity: warning
annotations:
summary: "Error budget burning fast"
SLA: Service Level Agreement
An SLA is a contract with your users or customers. It answers: "What happens if we miss our SLO?"
SLAs include:
- The promise — Service will be available 99.9% of the time
- The measurement — Measured monthly, excluding planned maintenance
- The consequence — 5% service credit for each 0.1% below target
Key distinction: SLOs are internal targets. SLAs are external contracts. Don't publish your SLOs as SLAs unless you're willing to pay the consequences.
SLA vs SLO Example
| | SLO (Internal) | SLA (Customer-Facing) | |---|---|---| | Target | 99.9% availability | 99.5% availability | | Audience | Engineering team | Paying customers | | Consequence | Feature freeze if missed | Service credits issued | | Measurement | All requests | Only authenticated requests |
Always set your SLO stricter than your SLA. This gives you an error budget buffer before customers notice.
Defining Good SLIs
Not every metric deserves to be an SLI. Good SLIs share three properties:
- Customer-facing — They measure what users actually experience, not internal system details
- Actionable — When the SLI breaches, the owning team knows what to fix
- Measurable — The metric can be collected reliably from production traffic
Common SLI Categories
| Category | SLI Examples | Measurement Method | |----------|-------------|-------------------| | Availability | Request success rate, uptime percentage | HTTP status codes, health check probes | | Latency | p50/p95/p99 response time | Request duration histograms | | Throughput | Requests per second, transactions per minute | Rate counters | | Durability | Data loss rate, replication lag | Event counters, database metrics | | Freshness | Time since last data update | Timestamp comparisons |
The Four Golden Signals
Google's SRE book defines four signals that cover most service health:
# 1. Latency: p99 response time (last 5 minutes)
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)
# 2. Traffic: requests per second
rate(http_requests_total[5m])
# 3. Errors: error ratio
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# 4. Saturation: CPU utilization
rate(node_cpu_seconds_total{mode="idle"}[5m])
Every service should have at least one SLI from each signal. Start with latency and errors, add traffic and saturation as you mature.
Setting Realistic SLO Targets
SLOs that are too tight cause alert fatigue. SLOs that are too loose hide real problems.
The 4-Step SLO Process
- Measure current performance for 30 days to establish a baseline
- Set a target that is achievable but requires improvement (e.g., if current p99 is 500ms, target 300ms)
- Define error budget based on the target:
(1 - SLO) * total requestsfor the window - Review quarterly — tighten or relax based on user impact data
SLO Target Examples by Service Criticality
| Criticality | Availability SLO | Latency SLO (p99) | Error Budget (30 days) | |-------------|-----------------|-------------------|----------------------| | Critical (payment, auth) | 99.99% | 100ms | 4.3 minutes downtime | | High (core API, cart) | 99.9% | 200ms | 43 minutes downtime | | Medium (recommendations) | 99.5% | 500ms | 3.6 hours downtime | | Low (admin dashboard) | 99.0% | 2s | 7.2 hours downtime |
Critical services leave almost no room for error. That is intentional — it forces investment in redundancy, auto-scaling, and multi-region architecture.
Error Budget Math in Practice
An error budget is the number of failures your service can have in a period before violating the SLO.
Calculating Error Budget
# Error budget for a 99.9% SLO over 30 days
# Total requests in 30 days
total_requests = sum(rate(http_requests_total[30d])) * 30 * 86400
# Maximum allowed errors
max_errors = total_requests * (1 - 0.999) # 0.1% of all requests
# Current errors
current_errors = sum(increase(http_requests_total{status=~"5.."}[30d]))
# Budget remaining (%)
budget_remaining = (1 - current_errors / max_errors) * 100
When to Stop Deploying
A common error budget policy: stop deployments when the budget drops below 50% for the current month:
# Multi-window, multi-burn-rate alert rule
groups:
- name: slo_alerts
rules:
- alert: ErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (0.001 * 14.4) # 14.4x burn rate
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget burn rate is too high"
When this alert fires, the on-call engineer freezes deployments until the burn rate normalizes.
Burn Rate Alerts: The Modern SRE Approach
Burn rate alerts tell you how fast you are consuming your error budget:
| Burn Rate | Time to Burn All Budget | Alert Severity | |-----------|------------------------|----------------| | 1x | 30 days | Warning (monitor) | | 2x | 15 days | Warning (investigate) | | 6x | 5 days | Critical (page) | | 14.4x | 2 days | Critical (page + freeze) |
Implementing Burn Rate Alerts in Prometheus
# 1-hour window, 14.4x burn rate (consumes 30-day budget in 2 days)
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5..",service="payment"}[1h]))
/
sum(rate(http_requests_total{service="payment"}[1h]))
) > 0.001 * 14.4
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate exceeds SLO burn rate"
# 5-minute window, 14.4x burn rate (catches very fast burns)
- alert: CriticalErrorSpike
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.001 * 14.4
for: 1m
labels:
severity: critical
Use multiple windows (1h, 5m, 5m short) to catch both fast and slow error budget consumption.
Conclusion
Start with one SLI for your most critical user journey—usually availability or latency of your main API. Set a realistic SLO based on historical data, not wishful thinking. Track your error budget and use it: if you're consistently within budget, ship features. If you're burning budget, freeze releases and fix reliability.
The goal isn't 100% uptime—it's making conscious trade-offs between reliability and feature velocity.
SLOs are not static — revisit them quarterly as your service and user expectations evolve. A service that started with a 99% SLO may need 99.9% as it becomes business-critical.