SLI vs SLO vs SLA: Real SRE Guide with Examples

Introduction

Every engineering team talks about uptime. "We need five nines." "Our SLA is 99.9%." "We hit our SLO this quarter."

But ask most engineers what an SLO actually means — mathematically, operationally, legally — and the confidence drops fast. Ask them what SLI their SLO is based on, and you will get a blank stare or a hand-wavy "uh, latency, I guess."

This confusion has real costs:

Teams define SLOs against meaningless SLIs — like "overall uptime" of a system that has 47 microservices, five of which are critical and the rest are decorative.
They set unrealistic targets because "five nines sounds good" without understanding the reliability budget.
They confuse SLOs (internal reliability targets) with SLAs (external, often legal commitments) and end up over-engineering for contracts that don't require it.

This guide fixes that. You will learn:

The precise definition of SLI, SLO, and SLA — and the differences that matter.
How to choose the right SLIs for real production services.
How to set achievable SLOs with math that works.
How SLAs relate to SLOs (spoiler: they are not the same).
Real Prometheus examples for measuring SLIs and burning down SLOs.
The common mistakes teams make — and how to avoid them.

Let's start with a single analogy that makes everything click.

The Analogy: Speedometer, Speed Limit, Traffic Ticket

Think of a car journey.

SLI is the speedometer. It measures something — how fast you are going right now. It is raw data. "The 95th percentile latency of the checkout endpoint over the last 5 minutes was 342 ms." That is an SLI.
SLO is the speed limit. It says: "95th percentile latency should be under 500 ms over a 30-day rolling window." That is your target. You can choose to drive faster (higher risk) or slower (more cautious).
SLA is the traffic ticket. If you violate the speed limit for too long, you pay a penalty. An SLA says: "If 95th percentile latency exceeds 500 ms for more than 0.1% of the month, we credit the customer 5% of their bill."

The speedometer tells you the current value. The speed limit tells you where you want to be. The ticket tells you what happens if you fail.

SLI: The Raw Measurement

Definition

A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the service you care about.

The key words are "carefully defined." A bad SLI definition leads to bad SLOs. A good SLI definition is specific, measurable, and meaningful to users.

The Four Golden Signals

Google's SRE literature defines four golden signals. Every service should have SLIs in at least these categories:

Signal	What it measures	Example SLI
Latency	How long it takes to respond	"95th percentile HTTP response time for GET /api/orders"
Traffic	How much demand is placed on the system	"Requests per second to the web frontend"
Errors	How many requests fail	"Ratio of HTTP 500 responses to total requests"
Saturation	How "full" the system is	"CPU utilization percentage across the cluster"

Most teams stop at latency and errors. That is a good start but incomplete. Saturation, in particular, is a leading indicator — if you only measure it when errors spike, you are always reacting.

Choosing Good SLIs

A good SLI has three properties:

User-visible. Measure what the user experiences, not what the infrastructure is doing. If the database is having replication lag but users are not affected, that is an ops concern, not an SLI. If users are affected (stale data, timeouts), then it becomes an SLI.
Measurable consistently. You need to collect the same measurement the same way every time. "Latency" is not an SLI. "p95 of the last 30 seconds of HTTP request duration measured server-side" is an SLI.
Actionable. If the SLI goes bad, someone should know what to do about it. "Number of times the database is restarted" is measurable but, alone, tells you nothing about what to fix.

Examples of Good vs Bad SLIs

❌ Bad SLI	Why it's bad	✅ Good SLI	Why it's better
"System uptime"	A monolith in a VM is up? That tells you nothing about responsiveness.	"Ratio of successful HTTP requests (2xx) to total requests over 1-minute windows"	Measures what users actually experience.
"Average latency"	Averages hide outliers. 99% of requests in 10ms, 1% in 30s — average is still ~300ms, looks fine.	"p99 HTTP latency over 5-minute rolling windows"	Captures the tail, which is what users feel.
"CPU usage"	CPU at 100% does not necessarily mean poor user experience.	"p99 latency when CPU > 80% vs p99 latency when CPU < 80%"	Ties infrastructure to user experience.

Defining SLIs in Prometheus

Assume you have a service exposing metrics via /metrics. To measure request duration:

# Request duration histogram — already exposed by your instrumentation
http_request_duration_seconds_bucket{job="checkout-service", le="0.1"}
http_request_duration_seconds_bucket{job="checkout-service", le="0.25"}
http_request_duration_seconds_bucket{job="checkout-service", le="0.5"}
http_request_duration_seconds_bucket{job="checkout-service", le="1.0"}
http_request_duration_seconds_bucket{job="checkout-service", le="+Inf"}
http_request_duration_seconds_count{job="checkout-service"}
http_request_duration_seconds_sum{job="checkout-service"}

Your latency SLI at p99 over the last 5 minutes:

histogram_quantile(
  0.99,
  rate(http_request_duration_seconds_bucket{job="checkout-service"}[5m])
)

Your error ratio SLI over the last 5 minutes — the fraction of requests that returned HTTP 5xx:

(
  sum(rate(http_requests_total{job="checkout-service", status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total{job="checkout-service"}[5m]))
)

Your availability SLI — the fraction of 1-minute windows where error ratio was under a threshold (e.g., under 1%):

avg_over_time(
  (
    (
      sum(rate(http_requests_total{job="checkout-service", status=~"5.."}[1m]))
      /
      sum(rate(http_requests_total{job="checkout-service"}[1m]))
    ) < 0.01
  )[5m:]
)

This last one is important. "Availability" measured as "number of good windows / total windows" is the standard approach used by Google (it is called SLI burn rate).

SLO: The Target

Definition

A Service Level Objective (SLO) is a target value or range for an SLI over a specified measurement window.

Example: "p99 latency of the checkout service stays under 500 ms for 99.9% of 1-minute windows in any rolling 30-day period."

That sentence contains:

The SLI (p99 latency)
The threshold (under 500 ms)
The measurement window (1-minute windows)
The compliance period (30 rolling days)
The target (99.9% of windows good)

The Error Budget

The most important concept in SRE. If your SLO says "99.9% good," then 0.1% of measurement windows can be bad. That 0.1% is your error budget.

For a 30-day period:

Total 1-minute windows: 30 days × 1440 minutes/day = 43,200 windows
Allowed bad windows at 99.9%: 43,200 × 0.001 = 43 windows
Allowed bad windows at 99.99%: 43,200 × 0.0001 = 4 windows
Allowed bad windows at 99.999% (five nines): 43,200 × 0.00001 = 0.4 windows (meaning you can barely afford any outage)

SLO Target	Minutes you can be down per month	Realistic?
99% ("one nine")	432 min (7.2 hours)	Easy for most services
99.9% ("three nines")	43 min	Achievable
99.95%	22 min	Good for critical services
99.99% ("four nines")	4.3 min	Hard — requires automation and redundancy
99.999% ("five nines")	26 seconds	Almost impossible without multi-region active-active

The error budget changes team behaviour. When the budget is healthy, teams deploy with confidence. When it is running low, teams become conservative — they throttle deployments, add testing, strengthen canary checks. This is the error budget policy.

Setting SLOs: A Practical Approach

Do not start with 99.9% because "it sounds right." Start with data.

Step 1: Collect your SLIs for at least 2–4 weeks.

Before you set a target, you need to know where you are. Run Prometheus, instrument your services, and let the data accumulate.

# Add OpenTelemetry instrumentation to your app
# Example: Python with OpenTelemetry
pip install opentelemetry-distro opentelemetry-exporter-prometheus

Step 2: Determine the worst acceptable performance.

Ask product owners: "What is the slowest response time that would make you consider the service broken?" Not the ideal speed — the worst acceptable.

For an API: "If p99 latency exceeds 1 second for more than 5 minutes, users complain." For a payment service: "Any failed transaction is unacceptable — 100% of requests must succeed." For a background job: "If it doesn't complete within 2 hours, the morning report is late."

Step 3: Add headroom.

Your SLO should be stricter than the absolute worst acceptable. If "p99 under 1 second" is the hard limit, set your SLO at p99 under 800 ms. If zero failed transactions is the ideal, set your SLO at 99.95% success rate (giving you a small error budget to handle bad deployments).

Step 4: Run it for a month and adjust.

The first SLO you set will be wrong. That is normal. Track it for 30 days, see how often you burn through the budget, and adjust.

Monitoring SLOs in Prometheus

You need to track your burn rate — how quickly you are consuming the error budget.

# SLO compliance for p99 latency under 500ms
# Good for 30-day period, evaluating over 1-minute windows

# Step 1: Which 1-minute windows are "bad"?
(
  histogram_quantile(
    0.99,
    rate(http_request_duration_seconds_bucket{job="checkout-service"}[1m])
  )
  > 0.5  # 500ms
)

# Step 2: Error budget consumed over last 30 days
1 - (
  avg_over_time(
    (
      histogram_quantile(
        0.99,
        rate(http_request_duration_seconds_bucket{job="checkout-service"}[1m])
      )
      <= 0.5
    )[30d:]
  )
)

A value of 0.002 means you have consumed 0.2% of your error budget in the last 30 days. If your SLO is 99.9% (0.1% budget), you are at 200% consumption — in trouble.

Alerting on Error Budget Burn Rate

Do not alert on raw latency spikes. Alert on the rate at which you are burning through your error budget.

Burn rate	What it means	Action
< 0.5x	Budget is being consumed slowly — normal operations.	No alert.
1x	Exactly on target.	Monitor but no action.
2x	Consuming budget twice as fast as planned.	Investigate within 24 hours.
5x	Serious degradation.	Page the on-call engineer within 30 minutes.
10x+	Critical incident.	War room. Immediate response.

Example Prometheus alert rule for a 5x burn rate sustained over 30 minutes:

groups:
  - name: slo_alerts
    rules:
      - alert: HighErrorBudgetBurnRate
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{job="checkout-service", status=~"5.."}[30m]))
              /
              sum(rate(http_requests_total{job="checkout-service"}[30m]))
            )
          ) < 0.90  # 90% success in last 30 minutes
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Error budget burning at >5x rate"
          description: "Error rate {{ $value | humanizePercentage }} over last 30 minutes"

SLA: The Contract

Definition

A Service Level Agreement (SLA) is a formal, legally enforceable contract between a service provider and a customer. It specifies:

The SLIs and SLOs the provider commits to.
The measurement methodology.
The penalties or credits if the SLO is not met.

SLAs are external. SLOs are internal. That distinction is critical.

SLA vs SLO: The Key Differences

	SLO	SLA
Audience	Internal engineering team	External customers
Purpose	Guide operational decisions	Contractual commitment
Consequence	Process changes, deployment throttle	Financial penalties, legal liability
Strictness	You can miss an SLO temporarily	Missing an SLA costs real money
Flexibility	Can be adjusted weekly	Hard to change — written into contracts
Measurement	Usually tighter than SLA	Usually looser than SLO

The SLO Margin

Smart teams set their internal SLO stricter than their external SLA. The gap is your safety margin.

SLA to customer:  99.9% availability
Internal SLO:     99.95% availability <-- buffer of 0.05%
                  ^^^^^^^^
                  You have 22 min of downtime allowance

This margin means:

You will miss the internal SLO long before you miss the external SLA.
You have time to react before customers experience a contract violation.
You can keep your infrastructure simpler (and cheaper) than if you had to guarantee the strictest target.

When SLAs Go Wrong

The most common SLA mistake: committing to something you cannot measure.

Example: "We guarantee p99 latency under 100ms." Sounds great. But if you measure latency from your load balancer (inside the data centre) and the customer measures it from their browser in rural Australia — those are different numbers. Your SLA needs to specify exactly where and how latency is measured.

Second mistake: committing to an SLO that is too tight for your architecture. A single-region deployment cannot realistically offer five nines. A database with a single primary cannot survive a failover without a brief blip. Your SLA must reflect your architecture's actual failure modes.

Putting It All Together: A Real Example

Let us walk through a real scenario. You run a payment service called payment-svc that processes credit card transactions.

Step 1: Define Your SLIs

# Payment service SLIs
sli_latency_p99:  "p99 latency of POST /api/charge over 5-minute windows"
sli_error_rate:   "Ratio of HTTP 5xx responses to total requests over 5-minute windows"
sli_throughput:   "Successful transactions per second"
sli_saturation:   "gRPC connection pool utilization percentage"

Step 2: Set Your Internal SLOs

Based on historical data and product requirements:

# Month-rolling SLOs for payment-svc
slo_latency_p99:
  target: 99.9%             # 0.1% bad windows allowed
  threshold: 300ms          # p99 under 300ms

slo_error_rate:
  target: 99.99%            # 0.01% bad windows allowed
  threshold: 0.001          # less than 0.1% error rate per window

slo_uptime:
  target: 99.95%            # based on simple request success count over 30d

Step 3: Define Your External SLA

Based on business requirements and what competitors offer:

# Customer SLA — intentionally looser than internal SLOs
sla_uptime:        99.9%    # 43 minutes downtime per month
sla_error_rate:    99.9%    # 0.1% error rate
penalty:           "5% monthly credit per 0.1% below SLA, max 50% credit"

Notice the gap: internal SLO for error rate is 99.99%, external SLA is 99.9%. That gives the team a 10× margin to absorb incidents before customers are impacted.

Step 4: Measure and Alert

Your monitoring dashboard shows:

Time period	Good windows	Total windows	Compliance
Last 24h	1,439	1,440	99.93%
Last 7d	10,067	10,080	99.87%
Last 30d	43,156	43,200	99.90%

The 7-day window is 99.87% — below the 99.9% SLO. The team knows they need to investigate. But the 30-day SLA target (99.9%) is barely being met. No customer penalty yet, but one more incident will push it under.

The alert rule catches this:

# Alert if 7-day compliance drops below 99.9%
(
  1 - avg_over_time(
    (
      sum(rate(http_requests_total{job="payment-svc", status=~"5.."}[1m]))
      /
      sum(rate(http_requests_total{job="payment-svc"}[1m]))
    ) > 0.001
  [7d:])
)
> 0.001  # More than 0.1% bad windows

The team pages, investigates, finds a newly deployed service that is not properly handling database connection timeouts, rolls back the deployment, and the error budget recovers.

Common Mistakes and How to Avoid Them

Mistake 1: Too Many SLIs

Teams measure everything — p50, p90, p95, p99, p99.9 of every endpoint, error rates by status code, by region, by instance. The result: alert fatigue and no clear picture.

Fix: Pick 3–5 SLIs per critical service. The golden signals are a good starting point. Add more only when you find a specific gap.

Mistake 2: SLOs Based on Averages

"Average latency" and "average availability" hide the real story. A service can have 99.9% "average uptime" over a month while being completely down for individual users.

Fix: Use percentiles (p95, p99) for latency. Use the "good windows" approach for availability — count windows where the service was good, not the average value.

Mistake 3: Identical SLOs for Every Service

A critical payment service and a background report generator should not share the same target. If you set all services at 99.99%, you are either over-engineering the background job or under-engineering the payment service.

Fix: Classify services by criticality — Tier 1 (customer-facing, revenue-critical), Tier 2 (important but not urgent), Tier 3 (internal tools, batch jobs). Set different SLOs per tier.

Tier	Example	SLO target	On-call response
1	Payment service, API gateway	99.95%	15-minute page
2	Reporting service, admin dashboard	99.9%	1-hour page
3	Internal data sync, ETL	99%	Next business day

Mistake 4: Setting SLOs Without Error Budget Policy

An SLO without an error budget policy is just a dashboard number. If you miss it, nothing happens differently. That defeats the entire purpose.

Fix: Write a one-page error budget policy:

Who decides when to stop deployments (typically the SRE lead or on-call).
At what budget level deployments stop (e.g., "deployments frozen when budget < 10% remaining").
How the budget resets (e.g., at the start of each calendar month).

Mistake 5: Confusing SLA with SLO

Committing the same target to customers that you use internally means zero margin for error. One incident = one missed SLA = financial penalties.

Fix: Always set your internal SLO 5–10× tighter than your external SLA. The cost of running slightly better infrastructure is almost always less than the cost of paying SLA penalties.

Actionable Takeaways

Start with 3–5 SLIs per critical service. Use the golden signals (latency, traffic, errors, saturation). Add more only when you find gaps.
Use percentiles, not averages. p99 latency and the "good windows" approach for availability. Averages will lie to you.
Set SLOs based on data, not intuition. Collect SLIs for at least 2 weeks before defining targets. The first SLO you set will be wrong — adjust it.
Always set internal SLOs tighter than external SLAs. The gap is your safety margin. At least 2× on error budgets.
Alert on error budget burn rate, not raw metrics. A latency spike that lasts 30 seconds is noise. A 10× burn rate sustained for 30 minutes is an incident.
Write an error budget policy. Define explicitly: when deployments stop, who decides, and how the budget resets.
Classify services by criticality. Tier 1 (99.95%), Tier 2 (99.9%), Tier 3 (99%). Do not apply one SLO to everything.

Need to implement SLIs in your infrastructure? Check out our Prometheus Monitoring Setup Guide and OpenTelemetry Tutorial for production-ready instrumentation.

SLI vs SLO vs SLA: Real SRE Guide with Examples

Introduction

The Analogy: Speedometer, Speed Limit, Traffic Ticket

SLI: The Raw Measurement

Definition

The Four Golden Signals

Choosing Good SLIs

Examples of Good vs Bad SLIs

Defining SLIs in Prometheus

SLO: The Target

Definition

The Error Budget

Setting SLOs: A Practical Approach

Monitoring SLOs in Prometheus

Alerting on Error Budget Burn Rate

SLA: The Contract

Definition

SLA vs SLO: The Key Differences

The SLO Margin

When SLAs Go Wrong

Putting It All Together: A Real Example

Step 1: Define Your SLIs

Step 2: Set Your Internal SLOs

Step 3: Define Your External SLA

Step 4: Measure and Alert

Common Mistakes and How to Avoid Them

Mistake 1: Too Many SLIs

Mistake 2: SLOs Based on Averages

Mistake 3: Identical SLOs for Every Service

Mistake 4: Setting SLOs Without Error Budget Policy

Mistake 5: Confusing SLA with SLO

Actionable Takeaways

Related Articles

AI Agents for SRE: Autonomous Incident Response in 2026

AI-Powered Observability: The Future of SRE Monitoring in 2026

Incident Management & Blameless Postmortem: SRE Guide 2026