Introduction
You have a 99.9 percent SLO target. Your team is on-call, paged for every 500 error, and deployments freeze because error rates are slightly elevated. Meanwhile, your feature velocity has ground to a halt.
This is the opposite of what SRE should achieve.
Error budgets exist to break this cycle. They provide a clear, data-driven framework for deciding when to prioritize reliability and when to prioritize feature development. This guide covers what error budgets are, how to calculate them, how to implement them with Prometheus, and how to build an error budget policy your team will actually use.
What is an Error Budget?
An error budget is the maximum amount of time or number of failures your service can experience in a given period before violating its SLO.
The formula is simple:
Error Budget = 1 - SLO Target
For a 99.9 percent SLO over 30 days:
- Total time: 30 days x 24 hours x 60 minutes = 43,200 minutes
- Error budget: 43,200 x (1 - 0.999) = 43.2 minutes of downtime
That is only 43 minutes of allowed downtime per month.
Why Error Budgets Matter
Without an error budget, every error feels critical. Teams burn out from alert fatigue. Deployments grind to a halt at the slightest degradation.
With an error budget:
- You know exactly how much unreliability is acceptable
- You stop paging for problems within the budget
- You make data-driven decisions about deployment freezes
- Engineering leadership has a clear metric for reliability investment
Calculating Error Budgets
Availability-Based Budgets
For availability SLOs, calculate the error budget in time:
30-day window:
Error budget = (30 x 24 x 60 x 60) x (1 - SLO)
99.9 percent SLO = 43.2 minutes
99.95 percent SLO = 21.6 minutes
99.99 percent SLO = 4.3 minutes
Weekly window (rolling):
Error budget = (7 x 24 x 60 x 60) x (1 - SLO)
99.9 percent SLO = 10.1 minutes
99.99 percent SLO = 1 minute
Request-Based Budgets
For request-driven services, calculate based on total requests:
Error budget = Total requests x (1 - SLO)
At 10,000 requests per minute:
Monthly requests = 10,000 x 60 x 24 x 30 = 432,000,000
Max errors at 99.9 percent = 432,000,000 x 0.001 = 432,000 errors
Error Budget Reference Table
| SLO Target | Downtime per Month | Downtime per Week | Downtime per Day | |------------|-------------------|-------------------|------------------| | 99% | 7h 12m | 1h 40m | 14m 24s | | 99.5% | 3h 36m | 50m 24s | 7m 12s | | 99.9% | 43m 12s | 10m 4s | 1m 26s | | 99.95% | 21m 36s | 5m 2s | 43s | | 99.99% | 4m 19s | 1m 0s | 8.6s | | 99.999% | 26s | 6s | 0.86s |
Keep this table handy. When someone asks how much downtime you can have, you have the answer.
Implementing Error Budgets with Prometheus
Measuring Error Budget Consumption
Track error budget in Prometheus using recording rules:
# prometheus-rules.yaml
groups:
- name: error_budget
interval: 30s
rules:
- record: job:slo_errors_total:rate30d
expr: |
sum(rate(http_requests_total{status=~"5.."}[30d]))
or
sum(rate(http_requests_total{status=~"5.."}[30d])) * 0
- record: job:slo_requests_total:rate30d
expr: |
sum(rate(http_requests_total[30d]))
- record: job:error_budget_remaining
expr: |
(1 - job:slo_errors_total:rate30d / job:slo_requests_total:rate30d / 0.001) * 100
labels:
slo: "99.9%"
- record: job:error_budget_consumed
expr: |
(job:slo_errors_total:rate30d / job:slo_requests_total:rate30d) / 0.001 * 100
labels:
slo: "99.9%"
Visualizing Error Budget in Grafana
Query for a simple budget remaining gauge panel:
(1 - sum(rate(http_requests_total{status=~"5.."}[30d])) /
sum(rate(http_requests_total[30d]))
/ 0.001) * 100
Color the panel:
- Green when budget is above 50 percent
- Yellow between 0 and 50 percent
- Red when budget is exhausted (0 percent or negative)
Add a second panel showing burn rate:
# Current burn rate (1 = normal, 14.4 = consuming budget in 2 days)
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) / 0.001
Burn Rate Alerts
Burn rate alerts tell you how fast you are consuming your error budget. This is the recommended modern approach over static threshold alerts.
Multi-Window Burn Rate Alerting
Google SRE recommends using two windows to catch both fast and slow budget consumption:
groups:
- name: burn_rate
rules:
# Fast burn: 14.4x over 1 hour (consumes 30-day budget in 2 days)
- alert: ErrorBudgetCritical
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > 0.001 * 14.4
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "Error budget critical burn rate"
description: "Error rate is {{ $value | humanizePercentage }} (SLO: 99.9%)"
# Slow burn: 2x over 6 hours (consumes 30-day budget in 15 days)
- alert: ErrorBudgetWarning
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > 0.001 * 2
for: 30m
labels:
severity: warning
team: backend
annotations:
summary: "Error budget warning burn rate"
Burn Rate Reference
| Burn Rate | Time to Exhaust Budget | Alert Severity | Action Required | |-----------|----------------------|----------------|-----------------| | 1x | 30 days | None | No action | | 2x | 15 days | Warning | Investigate during business hours | | 6x | 5 days | Critical | Page on-call, investigate immediately | | 14.4x | 2 days | Critical | Page on-call, freeze deploys if sustained over 30 min |
Building an Error Budget Policy
An error budget policy defines what happens when the budget is depleted. Without a policy, error budgets are just numbers on a dashboard.
Sample Policy
policy:
service: payment-api
slo: 99.9%
window: 30 days (rolling)
thresholds:
- budget_remaining: 50%
action: "Notify team in Slack. Review next sprint priorities."
severity: info
- budget_remaining: 25%
action: "Create a reliability ticket. Assign to current sprint."
severity: warning
- budget_remaining: 0%
action: "Freeze all feature deployments. Team works on reliability."
requires: "CTO approval to unfreeze"
severity: critical
exceptions:
- "Planned maintenance with 48h notice"
- "Third-party dependency outages (tracked separately)"
- "New feature launches: 2-week grace period with 50% SLO target"
Automating the Deploy Freeze
Automate the freeze based on budget status:
check-error-budget.sh
ERROR_BUDGET=$(curl -s http://prometheus:9090/api/v1/query \
--data-urlencode 'query=job:error_budget_remaining' \
| jq '.data.result[0].value[1] | tonumber')
if (( $(echo "$ERROR_BUDGET <= 0" | bc -l) )); then
echo "Error budget exhausted. Freezing deployments."
argocd app set payment-api --sync-policy automated=false
curl -X POST -H "Content-Type: application/json" \
--data '{"text":"Deploy freeze: error budget exhausted for payment-api"}' \
$SLACK_WEBHOOK_URL
exit 1
fi
In your CI/CD pipeline:
# .gitlab-ci.yml
deploy-production:
stage: deploy
before_script:
- ./check-error-budget.sh
script:
- helm upgrade --install payment-api ./chart
Error Budgets in Multi-Service Architectures
In a microservice architecture, error budgets become more complex. Downstream failures cascade.
Dependency Budgets
Each service has its own SLO and error budget. A downstream service burning its budget does not mean the upstream service is unhealthy:
User Service (99.9% SLO)
|-- calls to --> Payment Service (99.95% SLO)
|-- calls to --> Notification Service (99.5% SLO)
Track dependency budgets separately:
# User Service errors caused by Payment
sum(rate(user_service_errors{error_type="payment_timeout"}[30d]))
/
sum(rate(user_service_requests_total[30d]))
Composite Error Budgets
For customer-facing features that span multiple services, create a composite SLO:
# Checkout flow spans cart, payment, and inventory services
(
sum(rate(checkout_errors_total[30d]))
/
sum(rate(checkout_requests_total[30d]))
) > 0.001
Common Error Budget Mistakes
Mistake 1: Setting Budgets Too Tight
A 99.99 percent SLO for a non-critical internal dashboard means 4.3 minutes of downtime per month. Every deploy becomes a stress test. Set SLOs based on user impact:
| Service Type | Suggested SLO | Rationale | |-------------|---------------|-----------| | Payment processing | 99.99% | Direct revenue impact | | Core API | 99.9% | User-facing, moderate impact | | Internal reporting | 99.5% | No customer impact | | Background jobs | 99.0% | Retry-based, tolerant of downtime |
Mistake 2: Ignoring the Budget
An error budget that nobody checks is worthless. Put it on a dashboard visible to the whole team. Send weekly summaries to Slack.
Mistake 3: Using Averages
An average error rate of 0.1 percent over 30 days can hide a catastrophic 10-minute outage. Use burn rate alerts with multiple windows instead of long-period averages.
Mistake 4: Single Window for All Services
A 30-day window works for stable services. For new services or services undergoing major changes, use a 7-day window to react faster.
Further Reading
- SLI vs SLO vs SLA: A Practical Guide with Real Examples — Learn the difference between SLIs, SLOs, and SLAs before implementing error budgets. This companion guide walks through defining meaningful indicators for your services.
- Production Monitoring with Prometheus & Grafana: Complete Setup Guide — Set up the monitoring stack that powers your error budget dashboards. Covers Prometheus recording rules, Grafana panels, and alerting.
- SRE vs DevOps vs Platform Engineering: What's the Difference in 2026? — Understand where SRE practices like error budgets fit into the broader engineering organization structure.
- Docker Multi-Stage Builds: Slash Your Image Size by 90% — Smaller images mean faster deployments and smaller blast radius — directly reducing the risk of burning your error budget on bad deploys.
- 17 Kubernetes Mistakes That Cost Companies Millions — Many of these mistakes could have been caught by proper error budget monitoring. Learn what to avoid in production.
Conclusion
Error budgets transform reliability from an emotional debate into a data-driven decision. Instead of arguing about whether an outage was bad enough to delay a feature, you check a number.
Start small: pick one service, set a realistic SLO, measure the error budget, and put it on a dashboard. Add burn rate alerts. Define a simple policy: deploy freeze when budget hits zero.
The goal is not zero errors. The goal is spending your error budget on the things that matter most to your users.
Action items for this week:
- Calculate your current error rate for one critical service
- Set a realistic SLO (start with 99.5 percent or 99.9 percent)
- Create a Prometheus recording rule for error budget
- Add error budget gauge to your dashboard
- Define a simple error budget policy with your team