sre

SLI/SLO Implementation Guide with Prometheus & Grafana

SLI/SLO implementation with Prometheus recording rules, Grafana dashboards, burn rate alerts, and error budget policies — real YAML, real dashboards.

June 29, 2026·7 min read·
#slo#sli#sre#prometheus#grafana#error-budget#observability

Introduction

SLI/SLO theory is easy. Implementation is hard. Most teams define SLOs on a whiteboard — "99.9% uptime!" — and never wire them into their monitoring. When an incident hits, they scramble through dashboards, grep logs, and guess whether to escalate.

This guide builds a complete SLI/SLO implementation with Prometheus recording rules, Grafana dashboards, and burn rate alerts. By the end, you will have SLOs that wake you up only when it matters.

If you need the conceptual foundation first, start with our SLI vs SLO vs SLA guide. Then return here for the implementation.

Architecture Overview

The implementation pipeline has four stages:

  • Step 1: SLI measurement. Prometheus recording rules compute latency P99, error rate, and throughput from raw application metrics.
  • Step 2: SLO calculation. Prometheus rules compute error budget remaining from SLI data.
  • Step 3: Grafana visualization. SLO dashboards show budget burn rate, remaining budget, and compliance over time.
  • Step 4: Alerting. Prometheus alert rules trigger when burn rate exceeds threshold.

All of this lives in a single slo-rules.yaml file and a single Grafana dashboard JSON — no external services, no SaaS dependency.

Step 1: SLI Recording Rules

Create slo-rules.yml in your Prometheus config directory. Start with the four golden signals:

groups:
  - name: sli.rules
    interval: 30s
    rules:
      - record: job:latency_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))

      - record: job:error_rate:ratio5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

      - record: job:availability:ratio5m
        expr: |
          1 - (
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
            /
            sum(rate(http_requests_total[5m])) by (job)
          )

      - record: job:throughput:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

Key design decisions:

  • by (job) ensures per-service SLIs. One set of rules serves all your services. Change job to service or namespace to match your label topology.
  • histogram_quantile(0.99, ...) not average. Users at P99 experience your worst performance. Average latency hides the tail.
  • availability is error_rate inverted. "99.9% available" is clearer for stakeholders than "0.1% error rate."

These rules produce four clean SLI metrics per service. Apply them:

kubectl apply -f slo-rules.yml -n monitoring
# Or with Prometheus operator:
kubectl apply -f slo-rules.yml

Verify in the Prometheus UI (Targets → Rules) that all rules are evaluating.

Step 2: SLO Calculation Rules

SLOs answer: "How much of our error budget is left this month?" Add these rules to the same file:

  - name: slo.rules
    interval: 60s
    rules:
      - record: job:slo:error_budget_remaining_ratio
        expr: |
          1 - (
            (1 - slo_target)
            /
            (
              sum(rate(http_requests_total{status=~"5.."}[30d])) by (job)
              /
              sum(rate(http_requests_total[30d])) by (job)
            )
          )

      - record: job:slo:burn_rate_1h
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h])) by (job)
            /
            sum(rate(http_requests_total[1h])) by (job)
          )
          *
          (3600 * 24 * 30 / 3600)
          /
          (1 - slo_target)

      - record: job:slo:burn_rate_6h
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h])) by (job)
            /
            sum(rate(http_requests_total[6h])) by (job)
          )
          *
          (3600 * 24 * 30 / 21600)
          /
          (1 - slo_target)

Explanation of the math:

  • slo_target is a Prometheus constant representing your SLO (e.g., 0.999 for 99.9%).
  • Error budget remaining ratio = 1 - (allowed errors / actual errors). When this hits 0, the budget is exhausted.
  • Burn rate = actual error consumption rate / allowed error consumption rate. A burn rate of 1 means consuming budget exactly on schedule. Burn rate of 14.4 means consuming 14.4x too fast — the budget will exhaust in hours, not days.

The burn rate metrics are the foundation of meaningful alerting. Instead of "error rate > 1%," you alert on "burn rate > 14.4 for the last hour" — two conditions that together mean "this is consuming budget fast enough to matter."

For the complete error budget methodology, our error budgets SRE guide explains the theory behind these numbers and how to negotiate SLOs with product teams.

Step 3: Grafana Dashboard

Create a Grafana dashboard with these panels:

Panel 1: SLO Compliance Gauge

PromQL queries for a gauge showing "99.9% SLO: 87% compliant this month":

# Current error budget remaining (as percentage)
100 * job:slo:error_budget_remaining_ratio{job="payment-service"}

# SLO target line (constant)
100 * 0.999

Configure the gauge to show green above 50%, yellow 25-50%, red below 25%.

Panel 2: Burn Rate Over Time

# 1-hour burn rate (spike detection)
job:slo:burn_rate_1h{job="payment-service"}

# 6-hour burn rate (sustained detection)
job:slo:burn_rate_6h{job="payment-service"}

Add horizontal thresholds at 1 (expected), 6 (significant), and 14.4 (critical). When the 1h line crosses 14.4, an alert should fire.

Panel 3: Error Budget Timeline

# Error budget remaining over the last 30 days
job:slo:error_budget_remaining_ratio{job="payment-service"}

# Expected burn rate (linear from 100% to 0% over 30 days)
predict_linear(
  job:slo:error_budget_remaining_ratio{job="payment-service"}[30d],
  30 * 86400
)

The predict_linear line shows where the budget will be in 30 days if the current trend continues. If it crosses 0 before day 30, you have a problem.

Panel 4: SLI Detail (Error Rate, Latency, Throughput)

Three individual panels showing the raw SLIs:

# Error rate
job:error_rate:ratio5m{job="payment-service"}

# P99 latency
job:latency_seconds:p99{job="payment-service"}

# Throughput
job:throughput:rate5m{job="payment-service"}

Export the dashboard as JSON from Grafana and commit it to your infrastructure repo. Infrastructure as code includes dashboards.

Step 4: Burn Rate Alerting Rules

Prometheus alert rules trigger when burn rate exceeds critical thresholds:

groups:
  - name: slo.alerts
    rules:
      - alert: SLOErrorBudgetCriticalBurn
        expr: |
          job:slo:burn_rate_1h > 14.4
          and
          job:slo:burn_rate_6h > 6.0
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "SLO error budget burning critically for {{ $labels.job }}"
          description: |
            1h burn rate: {{ $value | humanize }}
            6h burn rate: {{ $labels.job:slo:burn_rate_6h }}
            Error budget remaining: {{ $labels.job:slo:error_budget_remaining_ratio | humanizePercentage }}

      - alert: SLOErrorBudgetWarning
        expr: |
          job:slo:burn_rate_6h > 1.0
          and
          job:slo:error_budget_remaining_ratio < 0.90
        for: 30m
        labels:
          severity: ticket
        annotations:
          summary: "SLO error budget burning above expected for {{ $labels.job }}"

Why multi-window alerts?

The and clause between 1h and 6h burn rates eliminates false positives. A single spike at 15x burn rate for 2 minutes — which could be a transient network blip — will NOT trigger the alert because the 6h window smooths it out. Only sustained budget consumption triggers pages.

This is the core insight from Google's SRE alerting methodology: alert on symptoms, not causes, and only when budget is at risk.

Step 5: Error Budget Policy

With SLOs instrumented, define what happens when budget burns:

# error-budget-policy.yml — stored in git alongside slo-rules
policies:
  - name: production-critical
    slo_target: 0.999
    windows: [30d]
    actions:
      - condition: "error_budget_remaining > 0.50"
        action: "feature_velocity_normal"
        description: "Ship features. Reliability on track."
      - condition: "error_budget_remaining > 0.25"
        action: "feature_freeze"
        description: "All eng time to reliability improvements. Escalate to EM."
      - condition: "error_budget_remaining <= 0.25"
        action: "incident_mode"
        description: "Halt all changes. Incident commander on-call."

This is not automation — it is a social contract stored in git, next to the SLO definitions. When the Grafana dashboard shows red, the team knows what to do without a meeting.

Production Considerations

  • Run SLO rules on a dedicated Prometheus. SLO queries span 30-day windows, which can be expensive. Use a separate Prometheus instance with long retention (Thanos/Cortex/Mimir) for SLO data.
  • Downsampling for long windows. Querying 30-day raw data for 50 services creates load. Use recording rules with 6h resolution for the 30-day window and combine with 5m resolution for the 1h window.
  • Multi-tenancy. If you run SLOs for multiple teams, use label-based tenancy in Grafana. Each team sees only their job labels.
  • GitOps everything. Prometheus rules, Grafana dashboards, and error budget policies live in git. Changes go through PR review — no manual edits in the UI.

For teams implementing distributed tracing alongside SLOs, our OpenTelemetry setup guide covers instrumenting applications with traces that correlate with these SLO metrics. When latency SLO burns, follow the trace to find the slow service.

SLO Implementation Checklist

  • SLI recording rules applied and evaluating (verify in Prometheus UI)
  • SLO calculation rules producing error budget metrics
  • Grafana dashboard showing compliance gauge, burn rate, and error budget timeline
  • Multi-window burn rate alerts configured and tested (trigger a canary alert)
  • Error budget policy documented in git
  • On-call team understands: "when SLO alert fires, stop features and fix reliability"

SLOs are not a monitoring feature — they are an engineering discipline. When every service has a measured, alertable SLO, the question "should we wake up at 3 AM for this?" has a data-driven answer.

#slo#sli#sre#prometheus#grafana#error-budget#observability
D
DevToCashAuthor

Senior DevOps/SRE Engineer · 10+ years · Professional Trader (IDX, Crypto, US Equities)

I write about real infrastructure patterns and trading strategies I use in production and in live markets. No courses, no affiliate hype — just documentation of what actually works.

More about me →