sre

Incident Management Runbook: The Complete SRE Template for 2026

A production-ready incident management runbook template for SRE and DevOps teams. Covers severity levels, roles, response lifecycle, automation, and a postmortem template you can copy today.

June 25, 2026·18 min read·
#incident-management#runbook#sre#incident-response#postmortem#pagerduty#opsgenie#on-call

Introduction

Every minute of downtime costs your company money. For an e-commerce platform, that is thousands of dollars per minute. For a fintech startup, it is lost trust that takes months to rebuild. Yet when an incident strikes, most teams still scramble through Slack, DMs, and scattered Google Docs — wasting the first 15 minutes just figuring out who owns what.

An Incident Management Runbook fixes this. It is a single source of truth that tells everyone — engineer, manager, or new hire awakened at 3 AM — exactly what to do, who to call, and how to escalate. It eliminates guesswork. It compresses time-to-resolution. It saves your team from burnout.

This guide provides a complete, copy-paste-ready incident management runbook built from real-world SRE practices at companies like Google, Netflix, and PagerDuty. By the end, you will have:

  • A severity-level framework your entire org can agree on
  • A role assignment system (Incident Commander, Comms Lead, etc.)
  • A step-by-step response lifecycle from detection to postmortem
  • A runbook template you can deploy to your wiki or runbook tool today
  • Automation patterns using PagerDuty, Opsgenie, and Slack integrations

Whether you are a two-person startup or a 200-engineer platform team, this runbook scales with you. Let's build it.

1. What Is an Incident Management Runbook?

A runbook is a documented, step-by-step procedure for responding to specific types of incidents. Unlike a general "incident response policy" (which says what to do at a high level), a runbook specifies exactly how to do it — which commands to run, which dashboards to check, which people to page.

A good runbook answers these questions before the incident starts:

  • Who is on-call right now, and who is their backup?
  • What constitutes a SEV1 vs SEV2 vs SEV3?
  • How do we declare an incident and notify stakeholders?
  • Where are the dashboards, logs, and runbooks for each service?
  • When do we escalate to the next tier or wake up the CTO?

The runbook should be stored in a place accessible during an outage — which means not on the company VPN if the VPN itself is down. Git repositories mirrored to multiple locations, printed copies in the NOC, or tools like PagerDuty Runbook Automation and Rundeck are common solutions.

2. Incident Severity Levels

Before anyone can respond, everyone must agree on what "SEV1" means. Without a shared severity framework, you get arguments during incidents about whether something is "really that bad" — wasting time when seconds matter.

Here is the framework used by most SRE organizations, adapted from Google's SRE book:

SeverityDefinitionExampleResponse SLAEscalation
SEV0Complete service outage. All users affected. Revenue impact.Website down, payment gateway offline, 100% 5xx errors5 min acknowledge, 15 min resolve or escalatePage CTO + VP Eng immediately
SEV1Major feature broken. Majority of users affected. No workaround.Login broken, checkout fails, API returning errors for 50%+ users10 min acknowledge, 30 min resolve or escalatePage Engineering Director + on-call manager
SEV2Partial degradation. Subset of users affected. Workaround exists.Slow page loads, search results stale, one region degraded30 min acknowledge, 2 hours resolve or escalateNotify team lead, on-call engineer handles
SEV3Minor issue. Cosmetic or non-critical.Typo on landing page, broken image in blog, non-critical cron job failureNext business dayCreate ticket, handle during business hours

Customize the thresholds for your business. An e-commerce site during Black Friday treats a 2% error rate as a SEV0. A SaaS tool on a Sunday afternoon treats it as a SEV2. Define what "revenue impact" means for your specific context.

Key principle: Err on the side of over-declaring. It is always better to downgrade a SEV2 → SEV3 after investigation than to discover a SEV0 was misclassified as a SEV2 for 45 minutes.

3. Incident Response Roles

Having clearly defined roles prevents the most common incident pitfall: everyone trying to do everything at once, drowning in Slack noise, and nobody communicating with stakeholders.

RoleResponsibilityWho Usually Fills It
Incident Commander (IC)Runs the incident. Makes all decisions. Keeps the response moving. Only person who can declare "resolved."Most senior on-call engineer, or designated IC from rotation
Operations Lead (OL)Investigates and mitigates. Runs commands, checks dashboards, implements fixes.On-call engineer for the affected service
Communications Lead (CL)Manages all external communication — status page updates, Slack announcements, customer-facing messages. Shields IC from interruptions.Engineering manager, TPM, or designated comms person
ScribeDocuments everything in real time — timeline, actions taken, hypotheses tested. Critical for postmortem.Junior engineer, intern, or automated via incident tooling

In a small team, one person may wear multiple hats — but never combine IC and CL. The IC needs uninterrupted focus on resolution; the CL absorbs all external noise. If your team is 4+ engineers on-call, rotate the IC role so nobody burns out.

When an incident is declared, the first person on the scene automatically becomes Interim IC until a designated IC joins. They announce in the incident channel:

/incident declare
SEV: [1/2/3]
Title: [Brief description]
IC: @username (interim)
Channel: #incident-2026-0625-001

4. The Incident Response Lifecycle

Every incident follows five phases. Your runbook should have a clear procedure for each.

Phase 1: Detection

Incidents are detected through three channels:

Automated monitoring — Alerts from Prometheus Alertmanager, Datadog, Grafana, or New Relic that fire when SLO burn rates exceed thresholds. If you do not have SLO-based alerting yet, read our Error Budgets Guide first.

User reports — Customer support tickets, social media complaints, or internal bug reports. Route these to the on-call channel automatically via webhook.

Engineer observation — A team member notices something wrong during a deployment or code review.

Regardless of how it is detected, the first step is always the same: verify the signal is real. Check the affected dashboard, run a quick smoke test, and confirm you are not chasing a monitoring false positive.

Phase 2: Declaration

Once verified, the on-call engineer declares the incident. This triggers:

  1. Create an incident channel#incident-YYYY-MMDD-NNN in Slack or Teams
  2. Page the on-call rotation via PagerDuty/Opsgenie
  3. Post initial status to the status page (or internal status channel if no public page)
  4. Assign roles — IC, OL, CL, Scribe

A declaration message looks like:

🚨 INCIDENT DECLARED — SEV2
Title: Checkout API returning 503 errors in us-east-1
IC: @alice (interim, @bob is primary IC joining in 2 min)
OL: @charlie
CL: @diana
Channel: #incident-2026-0625-003
Dashboard: https://grafana.example.com/d/checkout
Runbook: https://wiki.example.com/runbooks/checkout-api

Phase 3: Diagnosis & Mitigation

This is the core of incident response. The OL investigates while the IC coordinates. The process follows a structured loop:

  1. Triage — Isolate the blast radius. Which users? Which region? Which component?
  2. Hypothesize — Propose a likely cause. "Recent deploy changed the DB connection pool size."
  3. Test — Validate the hypothesis. Check deploy logs. Roll back if the hypothesis is strong enough.
  4. Mitigate — Stop the bleeding. Rollback, scale up, failover, feature flag off. Mitigation comes before root cause. A customer does not care why the site is down — they want it back up.
  5. Verify — Confirm the fix worked. Watch dashboards for 5-10 minutes.

Golden rule: If you have not found the cause in 15 minutes, escalate. Call in more engineers. Wake up the service owner. Do not hero-solo an incident.

Phase 4: Resolution

The IC declares the incident resolved only when:

  • Service is restored and verified for at least 10 minutes
  • All alerts have returned to normal
  • Customers are no longer impacted
  • A rollback or permanent fix is in place (not a fragile workaround)

The resolution message:

✅ INCIDENT RESOLVED — SEV2
Duration: 47 minutes (14:03 – 14:50 UTC)
Root cause: Connection pool exhaustion after config deploy v2.4.1
Mitigation: Rolled back to v2.4.0, connection pool restored
Impact: ~12% of checkout requests failed (est. 1,200 affected users)
Postmortem: Scheduled for 2026-06-26 10:00 UTC
Action items: #INC-42, #INC-43

Phase 5: Postmortem (Blameless)

Within 24-48 hours of resolution, hold a blameless postmortem. The goal is not to assign blame — it is to prevent recurrence. We cover the full postmortem template in Section 8.

5. The Runbook Template

Here is the actual runbook template. Copy this into your wiki, Notion, Confluence, or runbook automation tool. Fill in the [PLACEHOLDERS] for each service you own.

# Runbook: [SERVICE NAME]

## Service Overview
- **Owner Team:** [Team Name]
- **On-Call Rotation:** [PagerDuty/Opsgenie escalation policy link]
- **Primary Dashboard:** [Grafana/Datadog link]
- **Logs:** [Kibana/Loki/Splunk link]
- **Source Code:** [GitHub/GitLab link]
- **CI/CD Pipeline:** [GitHub Actions/GitLab CI link]
- **Runbook Last Updated:** [YYYY-MM-DD]

## Dependencies
- **Upstream:** [List services this depends on]
- **Downstream:** [List services that depend on this]
- **External:** [Third-party APIs, databases, CDNs]

## Alert Triggers
| Alert Name | Severity | Threshold | Dashboard Link |
|-----------|----------|-----------|---------------|
| High Error Rate | SEV1 | >5% 5xx for 5 min | [link] |
| High Latency p99 | SEV2 | >2s for 10 min | [link] |
| Pod CrashLoopBackOff | SEV1 | Any pod restarting >3 times | [link] |
| Certificate Expiry | SEV3 | <30 days until expiry | [link] |

## Common Incidents

### 1. High 5xx Error Rate
**Symptoms:** Dashboard shows error rate spike, users report failures
**Likely Causes:**
- Recent deployment introduced bug → Check deploy history
- Upstream dependency failure → Check dependency dashboards
- Database connection pool exhausted → Check DB metrics
- Rate limiting triggered → Check API gateway metrics

**Immediate Actions:**
1. Check recent deployments:

kubectl rollout history deployment/[service-name] -n production

2. Rollback if deployment was within last 30 minutes:

kubectl rollout undo deployment/[service-name] -n production

3. Check upstream dependencies:

curl -s https://[dependency]/health

4. Scale up replicas if traffic spike:

kubectl scale deployment/[service-name] --replicas=10 -n production

5. If none of the above help → escalate to [TEAM NAME], page [PERSON]

### 2. Pods in CrashLoopBackOff
**Symptoms:** `kubectl get pods` shows restarts, deployment not progressing
**Likely Causes:**
- Misconfigured environment variables or secrets
- Missing PersistentVolume or storage issue
- OOMKilled (memory limit too low)
- Readiness/Liveness probe misconfigured

**Immediate Actions:**
1. Check pod logs:

kubectl logs [pod-name] -n production --tail=100

2. Check previous container logs (if crash + restart):

kubectl logs [pod-name] -n production --previous

3. Describe the pod for events:

kubectl describe pod [pod-name] -n production

4. Check resource usage:

kubectl top pod [pod-name] -n production

5. If OOMKilled → increase memory limits and restart

### 3. Certificate Expiry
**Preventative:** Run this check weekly via cron:
```bash
echo | openssl s_client -servername [domain] -connect [domain]:443 2>/dev/null | \
openssl x509 -noout -dates

Escalation Path

LevelWhoWhenContact
L1On-call engineerImmediatePagerDuty rotation
L2Service owner / Tech LeadIf unresolved after 15 minSlack @team-leads
L3Engineering ManagerIf unresolved after 30 minPhone call
L4Director / VP EngineeringIf SEV0 after 45 minPhone call
L5CTOSEV0 lasting >1 hourPhone call

Post-Incident

  • Create postmortem doc within 24 hours
  • Create action items in issue tracker
  • Update this runbook if new causes or fixes were discovered
  • Verify monitoring covers the failure mode detected

This template is your starting point. Every service in your organization should have one. Keep it updated — an outdated runbook is worse than no runbook because it wastes time with stale information.

## 6. Automating the Runbook

A static runbook in a wiki is step one. The real SRE progression is toward **automated runbooks** — where the on-call engineer receives a pre-filled incident channel with relevant dashboards and diagnostic commands already executed.

### PagerDuty + Rundeck Automation

Integrate PagerDuty with Rundeck (or Ansible Automation Platform) to trigger diagnostic jobs automatically when an alert fires:

```yaml
# Rundeck job definition triggered by PagerDuty webhook
- name: checkout-api-auto-diagnose
  node: kubernetes-prod
  steps:
    - exec: kubectl get pods -n production -l app=checkout
    - exec: kubectl top pods -n production -l app=checkout
    - exec: kubectl logs -n production -l app=checkout --tail=50
    - exec: curl -s https://checkout-api/health

The output is posted to the incident Slack channel before the on-call engineer even opens their laptop.

Slack Slash Commands

Build Slack slash commands for common incident actions:

/incident declare checkout-api "503 errors in us-east-1" --sev=SEV2

→ Creates #incident-2026-0625-004, posts dashboard links, pages on-call, assigns IC.

/incident diagnose checkout-api

→ Runs kubectl describe, checks recent deployments, posts logs.

/incident resolve

→ Prompts for root cause, duration, impact summary, posts resolution template.

GitOps for Runbooks

Store runbooks as Markdown in the same Git repository as the service code. This enforces:

  • Version control — Every runbook change is reviewed via PR
  • Co-location — Developers update the runbook when they change the service
  • CI/CD integration — Runbook validity checks in CI (e.g., lint markdown, verify links)
my-service/
├── src/
├── Dockerfile
├── k8s/
└── RUNBOOK.md    # ← Living next to the code

7. Common Pitfalls (and How to Avoid Them)

Even teams with a runbook make these mistakes. Learn from them.

Pitfall 1: The Runbook Is Outdated

Symptom: On-call follows a runbook that references a decommissioned dashboard, a renamed Slack channel, or a service that was migrated six months ago.

Fix: Treat the runbook as code. Require a runbook update as part of every significant deployment or service change. Use a CI check that verifies all links in the runbook return HTTP 200. Set a calendar reminder to audit all runbooks quarterly.

# CI check: verify all URLs in runbook
grep -oP 'https?://[^\s)\]]+' RUNBOOK.md | sort -u | \
  while read url; do
    status=$(curl -sI -o /dev/null -w "%{http_code}" "$url")
    if [ "$status" != "200" ]; then
      echo "BROKEN: $url → $status"
      exit 1
    fi
  done

Pitfall 2: Too Many Alerts, Wrong Severity

Symptom: The on-call phone buzzes 40 times per night. Engineers develop alert fatigue. A real SEV0 gets lost in the noise.

Fix: Every alert must be actionable and correctly prioritized. If an alert fires and the correct response is "acknowledge and ignore," delete the alert. Use error budgets as the gating mechanism — only page when the error budget is burning too fast.

Pitfall 3: Hero Culture

Symptom: One senior engineer tries to solve everything alone. They do not escalate, do not communicate, and 90 minutes later, the SEV2 is now a SEV0.

Fix: Escalation is not weakness — it is process. The runbook's escalation path exists for a reason. The IC's job is to recognize when to pull in more people, not to solo the fix. Institute a hard rule: if the incident is not mitigated within the SLA window, escalation is mandatory, not optional.

Pitfall 4: No Communication During Incidents

Symptom: Stakeholders flood the IC with DMs. "Is it fixed yet?" "When will it be back?" "The CEO is asking." The IC cannot focus on actually fixing the problem.

Fix: The Communications Lead exists for exactly this reason. Their only job is to post status updates at regular intervals (every 15 minutes for SEV1, every 30 for SEV2) so nobody has to ask. Template:

📢 INCIDENT UPDATE — SEV2 — 14:20 UTC
Status: Still investigating. Checkout API returning 503s.
Mitigation attempted: Rollback to v2.4.0 — no improvement.
Current hypothesis: Upstream payment gateway timeout.
Next update: 14:35 UTC

Pitfall 5: Skipping the Postmortem

Symptom: Incident resolved. Everyone is tired. "We'll do the postmortem later." Later never comes.

Fix: Schedule the postmortem during the resolution call. Block 1 hour on everyone's calendar within 48 hours — while memory is fresh. A postmortem done a week later is half as valuable as one done while logs and timelines are still accessible. If your incident management tooling does not auto-schedule postmortems, add it as a manual step in your runbook.

8. Blameless Postmortem Template

A postmortem is a written record of what happened, why, and what will change. It is not about assigning fault. Use this template:

# Postmortem: [INCIDENT TITLE]

## Metadata
- **Incident ID:** INC-YYYY-MMDD-NNN
- **Date:** [YYYY-MM-DD]
- **Duration:** [HH:MM – HH:MM UTC] ([N] minutes)
- **Severity:** SEV[1/2/3]
- **Incident Commander:** @[name]
- **Postmortem Author:** @[name]
- **Status:** [Draft / Reviewed / Published]

## Summary
[One paragraph: what happened, impact, how it was fixed]

## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:03 | Prometheus alert fired: checkout-api 5xx > 5% |
| 14:05 | @alice acknowledged, began investigation |
| 14:08 | Incident declared SEV2 in #incident-2026-0625-003 |
| 14:12 | Rollback to v2.4.0 attempted — no improvement |
| 14:18 | Upstream payment gateway identified as root cause |
| 14:25 | Payment gateway team paged, confirmed outage on their side |
| 14:35 | Retry circuit breaker activated — error rate dropping |
| 14:47 | All metrics green, 10 min verification passed |
| 14:50 | Incident resolved |

## Root Cause
[Detailed technical explanation. What specific change, failure, or condition triggered the incident?]

## Impact
- **Users affected:** ~1,200 (12% of checkout traffic)
- **Revenue impact:** Estimated $3,400 in lost transactions
- **Data loss:** None
- **Security impact:** None

## What Went Well
- Alert fired within 2 minutes of error rate crossing threshold
- Incident Commander declared within 8 minutes of alert
- Communications Lead posted updates every 15 minutes
- Rollback was attempted quickly even though it didn't help

## What Went Poorly
- Payment gateway was not listed in service dependencies — added 13 min to diagnosis
- No circuit breaker was pre-configured for upstream failures
- Secondary on-call (backup IC) was unreachable for 10 min

## Action Items
| # | Action | Owner | Priority | Due |
|---|--------|-------|----------|-----|
| INC-42 | Add payment gateway to service dependency list and runbook | @charlie | P0 | 2026-06-27 |
| INC-43 | Implement circuit breaker with retry for all upstream calls | @alice | P1 | 2026-07-01 |
| INC-44 | Verify secondary on-call contact info in PagerDuty | @diana | P0 | 2026-06-26 |
| INC-45 | Add synthetic check for payment gateway health | @bob | P2 | 2026-07-15 |

## Lessons Learned
[1-3 sentences capturing the key takeaway for the broader org]

Store postmortems in a shared, searchable location. Over time, they become your organization's institutional memory — patterns emerge, recurring root causes become obvious, and you can justify infrastructure investments with real incident data.

9. Conclusion

An incident management runbook does not prevent incidents. What it does is far more valuable: it compresses the time between "something is wrong" and "it is fixed." It removes the cognitive load of deciding what to do under pressure and replaces it with a muscle-memory procedure.

Start today:

  1. Pick one service. Write its runbook using the template in Section 5.
  2. Define your severity levels. Get stakeholder alignment — nobody should argue about SEV during an incident.
  3. Practice. Run a fire drill. Fake an incident and walk through the runbook. Find the gaps before a real outage does.
  4. Automate one step. Even something small — an auto-created Slack channel or a diagnostic script — saves minutes during your next SEV2.

The best-run SRE teams do not have fewer incidents. They recover faster, communicate better, and learn more from each one. A runbook is how they do it.


Further Reading

#incident-management#runbook#sre#incident-response#postmortem#pagerduty#opsgenie#on-call
D
DevToCashAuthor

Senior DevOps/SRE Engineer · 10+ years · Professional Trader (IDX, Crypto, US Equities)

I write about real infrastructure patterns and trading strategies I use in production and in live markets. No courses, no affiliate hype — just documentation of what actually works.

More about me →