devops

Error Budgets: Stop Wasting Your SRE Team's Time

Learn error budget calculation, implementation, and policies for SRE teams. Practical guide with Prometheus burn rate alerts and real examples.

June 24, 2026·9 min read·
#sre#error-budgets#slo#prometheus#site-reliability

Introduction

You have a 99.9 percent SLO target. Your team is on-call, paged for every 500 error, and deployments freeze because error rates are slightly elevated. Meanwhile, your feature velocity has ground to a halt.

This is the opposite of what SRE should achieve.

Error budgets exist to break this cycle. They provide a clear, data-driven framework for deciding when to prioritize reliability and when to prioritize feature development. This guide covers what error budgets are, how to calculate them, how to implement them with Prometheus, and how to build an error budget policy your team will actually use.

What is an Error Budget?

An error budget is the maximum amount of time or number of failures your service can experience in a given period before violating its SLO.

The formula is simple:

Error Budget = 1 - SLO Target

For a 99.9 percent SLO over 30 days:

  • Total time: 30 days x 24 hours x 60 minutes = 43,200 minutes
  • Error budget: 43,200 x (1 - 0.999) = 43.2 minutes of downtime

That is only 43 minutes of allowed downtime per month.

Why Error Budgets Matter

Without an error budget, every error feels critical. Teams burn out from alert fatigue. Deployments grind to a halt at the slightest degradation.

With an error budget:

  • You know exactly how much unreliability is acceptable
  • You stop paging for problems within the budget
  • You make data-driven decisions about deployment freezes
  • Engineering leadership has a clear metric for reliability investment

Calculating Error Budgets

Availability-Based Budgets

For availability SLOs, calculate the error budget in time:

30-day window:
  Error budget = (30 x 24 x 60 x 60) x (1 - SLO)
  99.9 percent SLO = 43.2 minutes
  99.95 percent SLO = 21.6 minutes
  99.99 percent SLO = 4.3 minutes

  Weekly window (rolling):
  Error budget = (7 x 24 x 60 x 60) x (1 - SLO)
  99.9 percent SLO = 10.1 minutes
  99.99 percent SLO = 1 minute

Request-Based Budgets

For request-driven services, calculate based on total requests:

Error budget = Total requests x (1 - SLO)

At 10,000 requests per minute:
  Monthly requests = 10,000 x 60 x 24 x 30 = 432,000,000
  Max errors at 99.9 percent = 432,000,000 x 0.001 = 432,000 errors

Error Budget Reference Table

| SLO Target | Downtime per Month | Downtime per Week | Downtime per Day | |------------|-------------------|-------------------|------------------| | 99% | 7h 12m | 1h 40m | 14m 24s | | 99.5% | 3h 36m | 50m 24s | 7m 12s | | 99.9% | 43m 12s | 10m 4s | 1m 26s | | 99.95% | 21m 36s | 5m 2s | 43s | | 99.99% | 4m 19s | 1m 0s | 8.6s | | 99.999% | 26s | 6s | 0.86s |

Keep this table handy. When someone asks how much downtime you can have, you have the answer.

Implementing Error Budgets with Prometheus

Measuring Error Budget Consumption

Track error budget in Prometheus using recording rules:

# prometheus-rules.yaml
groups:
  - name: error_budget
    interval: 30s
    rules:
      - record: job:slo_errors_total:rate30d
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[30d]))
          or
          sum(rate(http_requests_total{status=~"5.."}[30d])) * 0

      - record: job:slo_requests_total:rate30d
        expr: |
          sum(rate(http_requests_total[30d]))

      - record: job:error_budget_remaining
        expr: |
          (1 - job:slo_errors_total:rate30d / job:slo_requests_total:rate30d / 0.001) * 100
        labels:
          slo: "99.9%"

      - record: job:error_budget_consumed
        expr: |
          (job:slo_errors_total:rate30d / job:slo_requests_total:rate30d) / 0.001 * 100
        labels:
          slo: "99.9%"

Visualizing Error Budget in Grafana

Query for a simple budget remaining gauge panel:

(1 - sum(rate(http_requests_total{status=~"5.."}[30d])) /
sum(rate(http_requests_total[30d]))
/ 0.001) * 100

Color the panel:

  • Green when budget is above 50 percent
  • Yellow between 0 and 50 percent
  • Red when budget is exhausted (0 percent or negative)

Add a second panel showing burn rate:

# Current burn rate (1 = normal, 14.4 = consuming budget in 2 days)
(
  sum(rate(http_requests_total{status=~"5.."}[1h]))
  /
  sum(rate(http_requests_total[1h]))
) / 0.001

Burn Rate Alerts

Burn rate alerts tell you how fast you are consuming your error budget. This is the recommended modern approach over static threshold alerts.

Multi-Window Burn Rate Alerting

Google SRE recommends using two windows to catch both fast and slow budget consumption:

groups:
  - name: burn_rate
    rules:
      # Fast burn: 14.4x over 1 hour (consumes 30-day budget in 2 days)
      - alert: ErrorBudgetCritical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > 0.001 * 14.4
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "Error budget critical burn rate"
          description: "Error rate is {{ $value | humanizePercentage }} (SLO: 99.9%)"

      # Slow burn: 2x over 6 hours (consumes 30-day budget in 15 days)
      - alert: ErrorBudgetWarning
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > 0.001 * 2
        for: 30m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "Error budget warning burn rate"

Burn Rate Reference

| Burn Rate | Time to Exhaust Budget | Alert Severity | Action Required | |-----------|----------------------|----------------|-----------------| | 1x | 30 days | None | No action | | 2x | 15 days | Warning | Investigate during business hours | | 6x | 5 days | Critical | Page on-call, investigate immediately | | 14.4x | 2 days | Critical | Page on-call, freeze deploys if sustained over 30 min |

Building an Error Budget Policy

An error budget policy defines what happens when the budget is depleted. Without a policy, error budgets are just numbers on a dashboard.

Sample Policy

policy:
  service: payment-api
  slo: 99.9%
  window: 30 days (rolling)

  thresholds:
    - budget_remaining: 50%
      action: "Notify team in Slack. Review next sprint priorities."
      severity: info

    - budget_remaining: 25%
      action: "Create a reliability ticket. Assign to current sprint."
      severity: warning

    - budget_remaining: 0%
      action: "Freeze all feature deployments. Team works on reliability."
      requires: "CTO approval to unfreeze"
      severity: critical

  exceptions:
    - "Planned maintenance with 48h notice"
    - "Third-party dependency outages (tracked separately)"
    - "New feature launches: 2-week grace period with 50% SLO target"

Automating the Deploy Freeze

Automate the freeze based on budget status:

check-error-budget.sh
ERROR_BUDGET=$(curl -s http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=job:error_budget_remaining' \
  | jq '.data.result[0].value[1] | tonumber')

if (( $(echo "$ERROR_BUDGET <= 0" | bc -l) )); then
  echo "Error budget exhausted. Freezing deployments."
  argocd app set payment-api --sync-policy automated=false
  curl -X POST -H "Content-Type: application/json" \
    --data '{"text":"Deploy freeze: error budget exhausted for payment-api"}' \
    $SLACK_WEBHOOK_URL
  exit 1
fi

In your CI/CD pipeline:

# .gitlab-ci.yml
deploy-production:
  stage: deploy
  before_script:
    - ./check-error-budget.sh
  script:
    - helm upgrade --install payment-api ./chart

Error Budgets in Multi-Service Architectures

In a microservice architecture, error budgets become more complex. Downstream failures cascade.

Dependency Budgets

Each service has its own SLO and error budget. A downstream service burning its budget does not mean the upstream service is unhealthy:

User Service (99.9% SLO)
  |-- calls to --> Payment Service (99.95% SLO)
  |-- calls to --> Notification Service (99.5% SLO)

Track dependency budgets separately:

# User Service errors caused by Payment
sum(rate(user_service_errors{error_type="payment_timeout"}[30d]))
/
sum(rate(user_service_requests_total[30d]))

Composite Error Budgets

For customer-facing features that span multiple services, create a composite SLO:

# Checkout flow spans cart, payment, and inventory services
(
  sum(rate(checkout_errors_total[30d]))
  /
  sum(rate(checkout_requests_total[30d]))
) > 0.001

Common Error Budget Mistakes

Mistake 1: Setting Budgets Too Tight

A 99.99 percent SLO for a non-critical internal dashboard means 4.3 minutes of downtime per month. Every deploy becomes a stress test. Set SLOs based on user impact:

| Service Type | Suggested SLO | Rationale | |-------------|---------------|-----------| | Payment processing | 99.99% | Direct revenue impact | | Core API | 99.9% | User-facing, moderate impact | | Internal reporting | 99.5% | No customer impact | | Background jobs | 99.0% | Retry-based, tolerant of downtime |

Mistake 2: Ignoring the Budget

An error budget that nobody checks is worthless. Put it on a dashboard visible to the whole team. Send weekly summaries to Slack.

Mistake 3: Using Averages

An average error rate of 0.1 percent over 30 days can hide a catastrophic 10-minute outage. Use burn rate alerts with multiple windows instead of long-period averages.

Mistake 4: Single Window for All Services

A 30-day window works for stable services. For new services or services undergoing major changes, use a 7-day window to react faster.

Further Reading

Conclusion

Error budgets transform reliability from an emotional debate into a data-driven decision. Instead of arguing about whether an outage was bad enough to delay a feature, you check a number.

Start small: pick one service, set a realistic SLO, measure the error budget, and put it on a dashboard. Add burn rate alerts. Define a simple policy: deploy freeze when budget hits zero.

The goal is not zero errors. The goal is spending your error budget on the things that matter most to your users.

Action items for this week:

  1. Calculate your current error rate for one critical service
  2. Set a realistic SLO (start with 99.5 percent or 99.9 percent)
  3. Create a Prometheus recording rule for error budget
  4. Add error budget gauge to your dashboard
  5. Define a simple error budget policy with your team
#sre#error-budgets#slo#prometheus#site-reliability
D
DevToCashAuthor

Senior DevOps/SRE Engineer · 10+ years · Professional Trader (IDX, Crypto, US Equities)

I write about real infrastructure patterns and trading strategies I use in production and in live markets. No courses, no affiliate hype — just documentation of what actually works.

More about me →