sre

SLI vs SLO vs SLA: The Real SRE Guide with Examples in 2026

Practical SLO definition with Prometheus recording rules, error budget policies, and SLO-based alerting. Stop alerting on raw latency and start alerting on burnout rate with real YAML examples.

June 28, 2026·4 min read·
#slo#sli#sla#sre#error-budget#prometheus#reliability

Introduction

SLOs (Service Level Objectives) are the core reliability contract between an SRE team and their users. Without SLOs, every incident feels urgent. With SLOs, you know whether to wake up at 3 AM — or to fix it Monday morning.

This guide covers defining SLOs in Prometheus, setting up burn rate alerts, and building error budget policies that prevent both over-alerting and under-investing in reliability.

SLIs: How to Measure the Right Things

An SLI (Service Level Indicator) is a metric that matters to users. The four golden signals:

  • Latency. P99 response time of HTTP requests. Not average — users at P99.9 are the ones writing angry tweets.
  • Error rate. Percentage of requests returning 5xx. Exclude 4xx (client errors) unless they represent a service-side problem.
  • Throughput. Requests per second. Drops in throughput are early warning signals.
  • Saturation. CPU, memory, connection pool exhaustion. The "how full is the bucket" metrics.

Prometheus recording rules for SLIs:

groups:
  - name: sli.rules
    rules:
      - record: job:request_latency_seconds:p99
        expr: histogram_quantile(0.99,
          rate(http_request_duration_seconds_bucket[5m]))

      - record: job:error_rate:ratio5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

These recording rules produce clean SLI metrics — ready for SLO calculations.

SLOs: Turning SLIs Into Commitments

An SLO is a target value for an SLI over a window. The canonical SRE SLO: "99.9% of requests succeed over 30 days." Define SLOs as Prometheus recording rules:

- record: job:slo_error_budget_remaining:ratio30d
  expr: |
    1 - (1 - 0.999)
    /
    (
      sum(rate(http_requests_total{status=~"5.."}[30d]))
      /
      sum(rate(http_requests_total[30d]))
    )

When this ratio drops below 1.0, the error budget for the 30-day window is exhausted. When it drops below 0.5, half the budget is burned.

Burn Rate Alerts: Stop Alerting on Every Error

Raw error rate alerts are noise. A brief spike may resolve in 30 seconds. Burn rate alerts use a rolling window to alert only when error budget is being consumed fast enough to matter:

groups:
  - name: slo.alerts
    rules:
      - alert: HighErrorBurnRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[1h]))
          /
          sum(rate(http_requests_total[1h]))
          > 14.4 * (1 - 0.999)
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Error budget burn rate critical"

The 14.4 multiplier: if you exhaust a 0.1% error budget in 1 hour instead of 30 days, you are burning 720x too fast. This filter ensures only budget-threatening events trigger pages — following the alerting methodology from the Google SRE workbook.

For fast-burning incidents, add a secondary alert:

- alert: MediumErrorBurnRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[6h]))
    /
    sum(rate(http_requests_total[6h]))
    > 1.0 * (1 - 0.999)
  for: 15m
  labels:
    severity: ticket

This pages when the trend will exhaust the budget in under 24 hours — giving the team one working day to respond before operations are impacted.

Error Budget Policies

Error budget policies are pre-agreed rules that govern team behavior:

  • Budget above 50%: Ship features at full speed. Reliability is within target.
  • Budget 25-50%: Freeze feature work. All engineering time goes to reliability improvements.
  • Budget below 25%: Halt all changes. Escalate to leadership. Incident response mode.

These policies prevent the "we will fix reliability next sprint" trap that burns out SRE teams.

Multi-Window Burn Rate Alerts

Google's SRE workbook recommends multi-window alerts for high-severity SLOs. Combine a short window (1h) for detection speed with a long window (6h) for noise reduction:

- alert: CriticalErrorBurnRate
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      / sum(rate(http_requests_total[1h]))
      > 14.4 * (1 - 0.999)
    )
    and
    (
      sum(rate(http_requests_total{status=~"5.."}[6h]))
      / sum(rate(http_requests_total[6h]))
      > 6.0 * (1 - 0.999)
    )
  for: 5m
  labels:
    severity: page

Both conditions must be true: the 1-hour burn rate is critical AND the 6-hour trend confirms it. This eliminates the majority of false positives while catching real incidents within 5 minutes.

For the full SLO framework — from SLI definition through error budget governance — see our SLI vs SLO vs SLA guide. For the incident management process that activates when budget burns, our SRE incident management runbook provides templates and escalation paths.

SLOs are not about perfection. They are about making data-driven decisions on when to ship features and when to fix production.

#slo#sli#sla#sre#error-budget#prometheus#reliability
D
DevToCashAuthor

Senior DevOps/SRE Engineer · 10+ years · Professional Trader (IDX, Crypto, US Equities)

I write about real infrastructure patterns and trading strategies I use in production and in live markets. No courses, no affiliate hype — just documentation of what actually works.

More about me →