Introduction
Every engineering team talks about uptime. "We need five nines." "Our SLA is 99.9%." "We hit our SLO this quarter."
But ask most engineers what an SLO actually means — mathematically, operationally, legally — and the confidence drops fast. Ask them what SLI their SLO is based on, and you will get a blank stare or a hand-wavy "uh, latency, I guess."
This confusion has real costs:
- Teams define SLOs against meaningless SLIs — like "overall uptime" of a system that has 47 microservices, five of which are critical and the rest are decorative.
- They set unrealistic targets because "five nines sounds good" without understanding the reliability budget.
- They confuse SLOs (internal reliability targets) with SLAs (external, often legal commitments) and end up over-engineering for contracts that don't require it.
This guide fixes that. You will learn:
- The precise definition of SLI, SLO, and SLA — and the differences that matter.
- How to choose the right SLIs for real production services.
- How to set achievable SLOs with math that works.
- How SLAs relate to SLOs (spoiler: they are not the same).
- Real Prometheus examples for measuring SLIs and burning down SLOs.
- The common mistakes teams make — and how to avoid them.
Let's start with a single analogy that makes everything click.
The Analogy: Speedometer, Speed Limit, Traffic Ticket
Think of a car journey.
- SLI is the speedometer. It measures something — how fast you are going right now. It is raw data. "The 95th percentile latency of the checkout endpoint over the last 5 minutes was 342 ms." That is an SLI.
- SLO is the speed limit. It says: "95th percentile latency should be under 500 ms over a 30-day rolling window." That is your target. You can choose to drive faster (higher risk) or slower (more cautious).
- SLA is the traffic ticket. If you violate the speed limit for too long, you pay a penalty. An SLA says: "If 95th percentile latency exceeds 500 ms for more than 0.1% of the month, we credit the customer 5% of their bill."
The speedometer tells you the current value. The speed limit tells you where you want to be. The ticket tells you what happens if you fail.
SLI: The Raw Measurement
Definition
A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the service you care about.
The key words are "carefully defined." A bad SLI definition leads to bad SLOs. A good SLI definition is specific, measurable, and meaningful to users.
The Four Golden Signals
Google's SRE literature defines four golden signals. Every service should have SLIs in at least these categories:
| Signal | What it measures | Example SLI |
|---|---|---|
| Latency | How long it takes to respond | "95th percentile HTTP response time for GET /api/orders" |
| Traffic | How much demand is placed on the system | "Requests per second to the web frontend" |
| Errors | How many requests fail | "Ratio of HTTP 500 responses to total requests" |
| Saturation | How "full" the system is | "CPU utilization percentage across the cluster" |
Most teams stop at latency and errors. That is a good start but incomplete. Saturation, in particular, is a leading indicator — if you only measure it when errors spike, you are always reacting.
Choosing Good SLIs
A good SLI has three properties:
-
User-visible. Measure what the user experiences, not what the infrastructure is doing. If the database is having replication lag but users are not affected, that is an ops concern, not an SLI. If users are affected (stale data, timeouts), then it becomes an SLI.
-
Measurable consistently. You need to collect the same measurement the same way every time. "Latency" is not an SLI. "p95 of the last 30 seconds of HTTP request duration measured server-side" is an SLI.
-
Actionable. If the SLI goes bad, someone should know what to do about it. "Number of times the database is restarted" is measurable but, alone, tells you nothing about what to fix.
Examples of Good vs Bad SLIs
| ❌ Bad SLI | Why it's bad | ✅ Good SLI | Why it's better |
|---|---|---|---|
| "System uptime" | A monolith in a VM is up? That tells you nothing about responsiveness. | "Ratio of successful HTTP requests (2xx) to total requests over 1-minute windows" | Measures what users actually experience. |
| "Average latency" | Averages hide outliers. 99% of requests in 10ms, 1% in 30s — average is still ~300ms, looks fine. | "p99 HTTP latency over 5-minute rolling windows" | Captures the tail, which is what users feel. |
| "CPU usage" | CPU at 100% does not necessarily mean poor user experience. | "p99 latency when CPU > 80% vs p99 latency when CPU < 80%" | Ties infrastructure to user experience. |
Defining SLIs in Prometheus
Assume you have a service exposing metrics via /metrics. To measure request duration:
# Request duration histogram — already exposed by your instrumentation
http_request_duration_seconds_bucket{job="checkout-service", le="0.1"}
http_request_duration_seconds_bucket{job="checkout-service", le="0.25"}
http_request_duration_seconds_bucket{job="checkout-service", le="0.5"}
http_request_duration_seconds_bucket{job="checkout-service", le="1.0"}
http_request_duration_seconds_bucket{job="checkout-service", le="+Inf"}
http_request_duration_seconds_count{job="checkout-service"}
http_request_duration_seconds_sum{job="checkout-service"}
Your latency SLI at p99 over the last 5 minutes:
histogram_quantile(
0.99,
rate(http_request_duration_seconds_bucket{job="checkout-service"}[5m])
)
Your error ratio SLI over the last 5 minutes — the fraction of requests that returned HTTP 5xx:
(
sum(rate(http_requests_total{job="checkout-service", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="checkout-service"}[5m]))
)
Your availability SLI — the fraction of 1-minute windows where error ratio was under a threshold (e.g., under 1%):
avg_over_time(
(
(
sum(rate(http_requests_total{job="checkout-service", status=~"5.."}[1m]))
/
sum(rate(http_requests_total{job="checkout-service"}[1m]))
) < 0.01
)[5m:]
)
This last one is important. "Availability" measured as "number of good windows / total windows" is the standard approach used by Google (it is called SLI burn rate).
SLO: The Target
Definition
A Service Level Objective (SLO) is a target value or range for an SLI over a specified measurement window.
Example: "p99 latency of the checkout service stays under 500 ms for 99.9% of 1-minute windows in any rolling 30-day period."
That sentence contains:
- The SLI (p99 latency)
- The threshold (under 500 ms)
- The measurement window (1-minute windows)
- The compliance period (30 rolling days)
- The target (99.9% of windows good)
The Error Budget
The most important concept in SRE. If your SLO says "99.9% good," then 0.1% of measurement windows can be bad. That 0.1% is your error budget.
For a 30-day period:
- Total 1-minute windows: 30 days × 1440 minutes/day = 43,200 windows
- Allowed bad windows at 99.9%: 43,200 × 0.001 = 43 windows
- Allowed bad windows at 99.99%: 43,200 × 0.0001 = 4 windows
- Allowed bad windows at 99.999% (five nines): 43,200 × 0.00001 = 0.4 windows (meaning you can barely afford any outage)
| SLO Target | Minutes you can be down per month | Realistic? |
|---|---|---|
| 99% ("one nine") | 432 min (7.2 hours) | Easy for most services |
| 99.9% ("three nines") | 43 min | Achievable |
| 99.95% | 22 min | Good for critical services |
| 99.99% ("four nines") | 4.3 min | Hard — requires automation and redundancy |
| 99.999% ("five nines") | 26 seconds | Almost impossible without multi-region active-active |
The error budget changes team behaviour. When the budget is healthy, teams deploy with confidence. When it is running low, teams become conservative — they throttle deployments, add testing, strengthen canary checks. This is the error budget policy.
Setting SLOs: A Practical Approach
Do not start with 99.9% because "it sounds right." Start with data.
Step 1: Collect your SLIs for at least 2–4 weeks.
Before you set a target, you need to know where you are. Run Prometheus, instrument your services, and let the data accumulate.
# Add OpenTelemetry instrumentation to your app
# Example: Python with OpenTelemetry
pip install opentelemetry-distro opentelemetry-exporter-prometheus
Step 2: Determine the worst acceptable performance.
Ask product owners: "What is the slowest response time that would make you consider the service broken?" Not the ideal speed — the worst acceptable.
For an API: "If p99 latency exceeds 1 second for more than 5 minutes, users complain." For a payment service: "Any failed transaction is unacceptable — 100% of requests must succeed." For a background job: "If it doesn't complete within 2 hours, the morning report is late."
Step 3: Add headroom.
Your SLO should be stricter than the absolute worst acceptable. If "p99 under 1 second" is the hard limit, set your SLO at p99 under 800 ms. If zero failed transactions is the ideal, set your SLO at 99.95% success rate (giving you a small error budget to handle bad deployments).
Step 4: Run it for a month and adjust.
The first SLO you set will be wrong. That is normal. Track it for 30 days, see how often you burn through the budget, and adjust.
Monitoring SLOs in Prometheus
You need to track your burn rate — how quickly you are consuming the error budget.
# SLO compliance for p99 latency under 500ms
# Good for 30-day period, evaluating over 1-minute windows
# Step 1: Which 1-minute windows are "bad"?
(
histogram_quantile(
0.99,
rate(http_request_duration_seconds_bucket{job="checkout-service"}[1m])
)
> 0.5 # 500ms
)
# Step 2: Error budget consumed over last 30 days
1 - (
avg_over_time(
(
histogram_quantile(
0.99,
rate(http_request_duration_seconds_bucket{job="checkout-service"}[1m])
)
<= 0.5
)[30d:]
)
)
A value of 0.002 means you have consumed 0.2% of your error budget in the last 30 days. If your SLO is 99.9% (0.1% budget), you are at 200% consumption — in trouble.
Alerting on Error Budget Burn Rate
Do not alert on raw latency spikes. Alert on the rate at which you are burning through your error budget.
| Burn rate | What it means | Action |
|---|---|---|
| < 0.5x | Budget is being consumed slowly — normal operations. | No alert. |
| 1x | Exactly on target. | Monitor but no action. |
| 2x | Consuming budget twice as fast as planned. | Investigate within 24 hours. |
| 5x | Serious degradation. | Page the on-call engineer within 30 minutes. |
| 10x+ | Critical incident. | War room. Immediate response. |
Example Prometheus alert rule for a 5x burn rate sustained over 30 minutes:
groups:
- name: slo_alerts
rules:
- alert: HighErrorBudgetBurnRate
expr: |
(
1 - (
sum(rate(http_requests_total{job="checkout-service", status=~"5.."}[30m]))
/
sum(rate(http_requests_total{job="checkout-service"}[30m]))
)
) < 0.90 # 90% success in last 30 minutes
for: 5m
labels:
severity: page
annotations:
summary: "Error budget burning at >5x rate"
description: "Error rate {{ $value | humanizePercentage }} over last 30 minutes"
SLA: The Contract
Definition
A Service Level Agreement (SLA) is a formal, legally enforceable contract between a service provider and a customer. It specifies:
- The SLIs and SLOs the provider commits to.
- The measurement methodology.
- The penalties or credits if the SLO is not met.
SLAs are external. SLOs are internal. That distinction is critical.
SLA vs SLO: The Key Differences
| SLO | SLA | |
|---|---|---|
| Audience | Internal engineering team | External customers |
| Purpose | Guide operational decisions | Contractual commitment |
| Consequence | Process changes, deployment throttle | Financial penalties, legal liability |
| Strictness | You can miss an SLO temporarily | Missing an SLA costs real money |
| Flexibility | Can be adjusted weekly | Hard to change — written into contracts |
| Measurement | Usually tighter than SLA | Usually looser than SLO |
The SLO Margin
Smart teams set their internal SLO stricter than their external SLA. The gap is your safety margin.
SLA to customer: 99.9% availability
Internal SLO: 99.95% availability <-- buffer of 0.05%
^^^^^^^^
You have 22 min of downtime allowance
This margin means:
- You will miss the internal SLO long before you miss the external SLA.
- You have time to react before customers experience a contract violation.
- You can keep your infrastructure simpler (and cheaper) than if you had to guarantee the strictest target.
When SLAs Go Wrong
The most common SLA mistake: committing to something you cannot measure.
Example: "We guarantee p99 latency under 100ms." Sounds great. But if you measure latency from your load balancer (inside the data centre) and the customer measures it from their browser in rural Australia — those are different numbers. Your SLA needs to specify exactly where and how latency is measured.
Second mistake: committing to an SLO that is too tight for your architecture. A single-region deployment cannot realistically offer five nines. A database with a single primary cannot survive a failover without a brief blip. Your SLA must reflect your architecture's actual failure modes.
Putting It All Together: A Real Example
Let us walk through a real scenario. You run a payment service called payment-svc that processes credit card transactions.
Step 1: Define Your SLIs
# Payment service SLIs
sli_latency_p99: "p99 latency of POST /api/charge over 5-minute windows"
sli_error_rate: "Ratio of HTTP 5xx responses to total requests over 5-minute windows"
sli_throughput: "Successful transactions per second"
sli_saturation: "gRPC connection pool utilization percentage"
Step 2: Set Your Internal SLOs
Based on historical data and product requirements:
# Month-rolling SLOs for payment-svc
slo_latency_p99:
target: 99.9% # 0.1% bad windows allowed
threshold: 300ms # p99 under 300ms
slo_error_rate:
target: 99.99% # 0.01% bad windows allowed
threshold: 0.001 # less than 0.1% error rate per window
slo_uptime:
target: 99.95% # based on simple request success count over 30d
Step 3: Define Your External SLA
Based on business requirements and what competitors offer:
# Customer SLA — intentionally looser than internal SLOs
sla_uptime: 99.9% # 43 minutes downtime per month
sla_error_rate: 99.9% # 0.1% error rate
penalty: "5% monthly credit per 0.1% below SLA, max 50% credit"
Notice the gap: internal SLO for error rate is 99.99%, external SLA is 99.9%. That gives the team a 10× margin to absorb incidents before customers are impacted.
Step 4: Measure and Alert
Your monitoring dashboard shows:
| Time period | Good windows | Total windows | Compliance |
|---|---|---|---|
| Last 24h | 1,439 | 1,440 | 99.93% |
| Last 7d | 10,067 | 10,080 | 99.87% |
| Last 30d | 43,156 | 43,200 | 99.90% |
The 7-day window is 99.87% — below the 99.9% SLO. The team knows they need to investigate. But the 30-day SLA target (99.9%) is barely being met. No customer penalty yet, but one more incident will push it under.
The alert rule catches this:
# Alert if 7-day compliance drops below 99.9%
(
1 - avg_over_time(
(
sum(rate(http_requests_total{job="payment-svc", status=~"5.."}[1m]))
/
sum(rate(http_requests_total{job="payment-svc"}[1m]))
) > 0.001
[7d:])
)
> 0.001 # More than 0.1% bad windows
The team pages, investigates, finds a newly deployed service that is not properly handling database connection timeouts, rolls back the deployment, and the error budget recovers.
Common Mistakes and How to Avoid Them
Mistake 1: Too Many SLIs
Teams measure everything — p50, p90, p95, p99, p99.9 of every endpoint, error rates by status code, by region, by instance. The result: alert fatigue and no clear picture.
Fix: Pick 3–5 SLIs per critical service. The golden signals are a good starting point. Add more only when you find a specific gap.
Mistake 2: SLOs Based on Averages
"Average latency" and "average availability" hide the real story. A service can have 99.9% "average uptime" over a month while being completely down for individual users.
Fix: Use percentiles (p95, p99) for latency. Use the "good windows" approach for availability — count windows where the service was good, not the average value.
Mistake 3: Identical SLOs for Every Service
A critical payment service and a background report generator should not share the same target. If you set all services at 99.99%, you are either over-engineering the background job or under-engineering the payment service.
Fix: Classify services by criticality — Tier 1 (customer-facing, revenue-critical), Tier 2 (important but not urgent), Tier 3 (internal tools, batch jobs). Set different SLOs per tier.
| Tier | Example | SLO target | On-call response |
|---|---|---|---|
| 1 | Payment service, API gateway | 99.95% | 15-minute page |
| 2 | Reporting service, admin dashboard | 99.9% | 1-hour page |
| 3 | Internal data sync, ETL | 99% | Next business day |
Mistake 4: Setting SLOs Without Error Budget Policy
An SLO without an error budget policy is just a dashboard number. If you miss it, nothing happens differently. That defeats the entire purpose.
Fix: Write a one-page error budget policy:
- Who decides when to stop deployments (typically the SRE lead or on-call).
- At what budget level deployments stop (e.g., "deployments frozen when budget < 10% remaining").
- How the budget resets (e.g., at the start of each calendar month).
Mistake 5: Confusing SLA with SLO
Committing the same target to customers that you use internally means zero margin for error. One incident = one missed SLA = financial penalties.
Fix: Always set your internal SLO 5–10× tighter than your external SLA. The cost of running slightly better infrastructure is almost always less than the cost of paying SLA penalties.
Actionable Takeaways
-
Start with 3–5 SLIs per critical service. Use the golden signals (latency, traffic, errors, saturation). Add more only when you find gaps.
-
Use percentiles, not averages. p99 latency and the "good windows" approach for availability. Averages will lie to you.
-
Set SLOs based on data, not intuition. Collect SLIs for at least 2 weeks before defining targets. The first SLO you set will be wrong — adjust it.
-
Always set internal SLOs tighter than external SLAs. The gap is your safety margin. At least 2× on error budgets.
-
Alert on error budget burn rate, not raw metrics. A latency spike that lasts 30 seconds is noise. A 10× burn rate sustained for 30 minutes is an incident.
-
Write an error budget policy. Define explicitly: when deployments stop, who decides, and how the budget resets.
-
Classify services by criticality. Tier 1 (99.95%), Tier 2 (99.9%), Tier 3 (99%). Do not apply one SLO to everything.
Need to implement SLIs in your infrastructure? Check out our Prometheus Monitoring Setup Guide and OpenTelemetry Tutorial for production-ready instrumentation.