Introduction
Every minute of downtime costs your company money. For an e-commerce platform, that is thousands of dollars per minute. For a fintech startup, it is lost trust that takes months to rebuild. Yet when an incident strikes, most teams still scramble through Slack, DMs, and scattered Google Docs — wasting the first 15 minutes just figuring out who owns what.
An Incident Management Runbook fixes this. It is a single source of truth that tells everyone — engineer, manager, or new hire awakened at 3 AM — exactly what to do, who to call, and how to escalate. It eliminates guesswork. It compresses time-to-resolution. It saves your team from burnout.
This guide provides a complete, copy-paste-ready incident management runbook built from real-world SRE practices at companies like Google, Netflix, and PagerDuty. By the end, you will have:
- A severity-level framework your entire org can agree on
- A role assignment system (Incident Commander, Comms Lead, etc.)
- A step-by-step response lifecycle from detection to postmortem
- A runbook template you can deploy to your wiki or runbook tool today
- Automation patterns using PagerDuty, Opsgenie, and Slack integrations
Whether you are a two-person startup or a 200-engineer platform team, this runbook scales with you. Let's build it.
1. What Is an Incident Management Runbook?
A runbook is a documented, step-by-step procedure for responding to specific types of incidents. Unlike a general "incident response policy" (which says what to do at a high level), a runbook specifies exactly how to do it — which commands to run, which dashboards to check, which people to page.
A good runbook answers these questions before the incident starts:
- Who is on-call right now, and who is their backup?
- What constitutes a SEV1 vs SEV2 vs SEV3?
- How do we declare an incident and notify stakeholders?
- Where are the dashboards, logs, and runbooks for each service?
- When do we escalate to the next tier or wake up the CTO?
The runbook should be stored in a place accessible during an outage — which means not on the company VPN if the VPN itself is down. Git repositories mirrored to multiple locations, printed copies in the NOC, or tools like PagerDuty Runbook Automation and Rundeck are common solutions.
2. Incident Severity Levels
Before anyone can respond, everyone must agree on what "SEV1" means. Without a shared severity framework, you get arguments during incidents about whether something is "really that bad" — wasting time when seconds matter.
Here is the framework used by most SRE organizations, adapted from Google's SRE book:
| Severity | Definition | Example | Response SLA | Escalation |
|---|---|---|---|---|
| SEV0 | Complete service outage. All users affected. Revenue impact. | Website down, payment gateway offline, 100% 5xx errors | 5 min acknowledge, 15 min resolve or escalate | Page CTO + VP Eng immediately |
| SEV1 | Major feature broken. Majority of users affected. No workaround. | Login broken, checkout fails, API returning errors for 50%+ users | 10 min acknowledge, 30 min resolve or escalate | Page Engineering Director + on-call manager |
| SEV2 | Partial degradation. Subset of users affected. Workaround exists. | Slow page loads, search results stale, one region degraded | 30 min acknowledge, 2 hours resolve or escalate | Notify team lead, on-call engineer handles |
| SEV3 | Minor issue. Cosmetic or non-critical. | Typo on landing page, broken image in blog, non-critical cron job failure | Next business day | Create ticket, handle during business hours |
Customize the thresholds for your business. An e-commerce site during Black Friday treats a 2% error rate as a SEV0. A SaaS tool on a Sunday afternoon treats it as a SEV2. Define what "revenue impact" means for your specific context.
Key principle: Err on the side of over-declaring. It is always better to downgrade a SEV2 → SEV3 after investigation than to discover a SEV0 was misclassified as a SEV2 for 45 minutes.
3. Incident Response Roles
Having clearly defined roles prevents the most common incident pitfall: everyone trying to do everything at once, drowning in Slack noise, and nobody communicating with stakeholders.
| Role | Responsibility | Who Usually Fills It |
|---|---|---|
| Incident Commander (IC) | Runs the incident. Makes all decisions. Keeps the response moving. Only person who can declare "resolved." | Most senior on-call engineer, or designated IC from rotation |
| Operations Lead (OL) | Investigates and mitigates. Runs commands, checks dashboards, implements fixes. | On-call engineer for the affected service |
| Communications Lead (CL) | Manages all external communication — status page updates, Slack announcements, customer-facing messages. Shields IC from interruptions. | Engineering manager, TPM, or designated comms person |
| Scribe | Documents everything in real time — timeline, actions taken, hypotheses tested. Critical for postmortem. | Junior engineer, intern, or automated via incident tooling |
In a small team, one person may wear multiple hats — but never combine IC and CL. The IC needs uninterrupted focus on resolution; the CL absorbs all external noise. If your team is 4+ engineers on-call, rotate the IC role so nobody burns out.
When an incident is declared, the first person on the scene automatically becomes Interim IC until a designated IC joins. They announce in the incident channel:
/incident declare
SEV: [1/2/3]
Title: [Brief description]
IC: @username (interim)
Channel: #incident-2026-0625-001
4. The Incident Response Lifecycle
Every incident follows five phases. Your runbook should have a clear procedure for each.
Phase 1: Detection
Incidents are detected through three channels:
Automated monitoring — Alerts from Prometheus Alertmanager, Datadog, Grafana, or New Relic that fire when SLO burn rates exceed thresholds. If you do not have SLO-based alerting yet, read our Error Budgets Guide first.
User reports — Customer support tickets, social media complaints, or internal bug reports. Route these to the on-call channel automatically via webhook.
Engineer observation — A team member notices something wrong during a deployment or code review.
Regardless of how it is detected, the first step is always the same: verify the signal is real. Check the affected dashboard, run a quick smoke test, and confirm you are not chasing a monitoring false positive.
Phase 2: Declaration
Once verified, the on-call engineer declares the incident. This triggers:
- Create an incident channel —
#incident-YYYY-MMDD-NNNin Slack or Teams - Page the on-call rotation via PagerDuty/Opsgenie
- Post initial status to the status page (or internal status channel if no public page)
- Assign roles — IC, OL, CL, Scribe
A declaration message looks like:
🚨 INCIDENT DECLARED — SEV2
Title: Checkout API returning 503 errors in us-east-1
IC: @alice (interim, @bob is primary IC joining in 2 min)
OL: @charlie
CL: @diana
Channel: #incident-2026-0625-003
Dashboard: https://grafana.example.com/d/checkout
Runbook: https://wiki.example.com/runbooks/checkout-api
Phase 3: Diagnosis & Mitigation
This is the core of incident response. The OL investigates while the IC coordinates. The process follows a structured loop:
- Triage — Isolate the blast radius. Which users? Which region? Which component?
- Hypothesize — Propose a likely cause. "Recent deploy changed the DB connection pool size."
- Test — Validate the hypothesis. Check deploy logs. Roll back if the hypothesis is strong enough.
- Mitigate — Stop the bleeding. Rollback, scale up, failover, feature flag off. Mitigation comes before root cause. A customer does not care why the site is down — they want it back up.
- Verify — Confirm the fix worked. Watch dashboards for 5-10 minutes.
Golden rule: If you have not found the cause in 15 minutes, escalate. Call in more engineers. Wake up the service owner. Do not hero-solo an incident.
Phase 4: Resolution
The IC declares the incident resolved only when:
- Service is restored and verified for at least 10 minutes
- All alerts have returned to normal
- Customers are no longer impacted
- A rollback or permanent fix is in place (not a fragile workaround)
The resolution message:
✅ INCIDENT RESOLVED — SEV2
Duration: 47 minutes (14:03 – 14:50 UTC)
Root cause: Connection pool exhaustion after config deploy v2.4.1
Mitigation: Rolled back to v2.4.0, connection pool restored
Impact: ~12% of checkout requests failed (est. 1,200 affected users)
Postmortem: Scheduled for 2026-06-26 10:00 UTC
Action items: #INC-42, #INC-43
Phase 5: Postmortem (Blameless)
Within 24-48 hours of resolution, hold a blameless postmortem. The goal is not to assign blame — it is to prevent recurrence. We cover the full postmortem template in Section 8.
5. The Runbook Template
Here is the actual runbook template. Copy this into your wiki, Notion, Confluence, or runbook automation tool. Fill in the [PLACEHOLDERS] for each service you own.
# Runbook: [SERVICE NAME]
## Service Overview
- **Owner Team:** [Team Name]
- **On-Call Rotation:** [PagerDuty/Opsgenie escalation policy link]
- **Primary Dashboard:** [Grafana/Datadog link]
- **Logs:** [Kibana/Loki/Splunk link]
- **Source Code:** [GitHub/GitLab link]
- **CI/CD Pipeline:** [GitHub Actions/GitLab CI link]
- **Runbook Last Updated:** [YYYY-MM-DD]
## Dependencies
- **Upstream:** [List services this depends on]
- **Downstream:** [List services that depend on this]
- **External:** [Third-party APIs, databases, CDNs]
## Alert Triggers
| Alert Name | Severity | Threshold | Dashboard Link |
|-----------|----------|-----------|---------------|
| High Error Rate | SEV1 | >5% 5xx for 5 min | [link] |
| High Latency p99 | SEV2 | >2s for 10 min | [link] |
| Pod CrashLoopBackOff | SEV1 | Any pod restarting >3 times | [link] |
| Certificate Expiry | SEV3 | <30 days until expiry | [link] |
## Common Incidents
### 1. High 5xx Error Rate
**Symptoms:** Dashboard shows error rate spike, users report failures
**Likely Causes:**
- Recent deployment introduced bug → Check deploy history
- Upstream dependency failure → Check dependency dashboards
- Database connection pool exhausted → Check DB metrics
- Rate limiting triggered → Check API gateway metrics
**Immediate Actions:**
1. Check recent deployments:
kubectl rollout history deployment/[service-name] -n production
2. Rollback if deployment was within last 30 minutes:
kubectl rollout undo deployment/[service-name] -n production
3. Check upstream dependencies:
curl -s https://[dependency]/health
4. Scale up replicas if traffic spike:
kubectl scale deployment/[service-name] --replicas=10 -n production
5. If none of the above help → escalate to [TEAM NAME], page [PERSON]
### 2. Pods in CrashLoopBackOff
**Symptoms:** `kubectl get pods` shows restarts, deployment not progressing
**Likely Causes:**
- Misconfigured environment variables or secrets
- Missing PersistentVolume or storage issue
- OOMKilled (memory limit too low)
- Readiness/Liveness probe misconfigured
**Immediate Actions:**
1. Check pod logs:
kubectl logs [pod-name] -n production --tail=100
2. Check previous container logs (if crash + restart):
kubectl logs [pod-name] -n production --previous
3. Describe the pod for events:
kubectl describe pod [pod-name] -n production
4. Check resource usage:
kubectl top pod [pod-name] -n production
5. If OOMKilled → increase memory limits and restart
### 3. Certificate Expiry
**Preventative:** Run this check weekly via cron:
```bash
echo | openssl s_client -servername [domain] -connect [domain]:443 2>/dev/null | \
openssl x509 -noout -dates
Escalation Path
| Level | Who | When | Contact |
|---|---|---|---|
| L1 | On-call engineer | Immediate | PagerDuty rotation |
| L2 | Service owner / Tech Lead | If unresolved after 15 min | Slack @team-leads |
| L3 | Engineering Manager | If unresolved after 30 min | Phone call |
| L4 | Director / VP Engineering | If SEV0 after 45 min | Phone call |
| L5 | CTO | SEV0 lasting >1 hour | Phone call |
Post-Incident
- Create postmortem doc within 24 hours
- Create action items in issue tracker
- Update this runbook if new causes or fixes were discovered
- Verify monitoring covers the failure mode detected
This template is your starting point. Every service in your organization should have one. Keep it updated — an outdated runbook is worse than no runbook because it wastes time with stale information.
## 6. Automating the Runbook
A static runbook in a wiki is step one. The real SRE progression is toward **automated runbooks** — where the on-call engineer receives a pre-filled incident channel with relevant dashboards and diagnostic commands already executed.
### PagerDuty + Rundeck Automation
Integrate PagerDuty with Rundeck (or Ansible Automation Platform) to trigger diagnostic jobs automatically when an alert fires:
```yaml
# Rundeck job definition triggered by PagerDuty webhook
- name: checkout-api-auto-diagnose
node: kubernetes-prod
steps:
- exec: kubectl get pods -n production -l app=checkout
- exec: kubectl top pods -n production -l app=checkout
- exec: kubectl logs -n production -l app=checkout --tail=50
- exec: curl -s https://checkout-api/health
The output is posted to the incident Slack channel before the on-call engineer even opens their laptop.
Slack Slash Commands
Build Slack slash commands for common incident actions:
/incident declare checkout-api "503 errors in us-east-1" --sev=SEV2
→ Creates #incident-2026-0625-004, posts dashboard links, pages on-call, assigns IC.
/incident diagnose checkout-api
→ Runs kubectl describe, checks recent deployments, posts logs.
/incident resolve
→ Prompts for root cause, duration, impact summary, posts resolution template.
GitOps for Runbooks
Store runbooks as Markdown in the same Git repository as the service code. This enforces:
- Version control — Every runbook change is reviewed via PR
- Co-location — Developers update the runbook when they change the service
- CI/CD integration — Runbook validity checks in CI (e.g., lint markdown, verify links)
my-service/
├── src/
├── Dockerfile
├── k8s/
└── RUNBOOK.md # ← Living next to the code
7. Common Pitfalls (and How to Avoid Them)
Even teams with a runbook make these mistakes. Learn from them.
Pitfall 1: The Runbook Is Outdated
Symptom: On-call follows a runbook that references a decommissioned dashboard, a renamed Slack channel, or a service that was migrated six months ago.
Fix: Treat the runbook as code. Require a runbook update as part of every significant deployment or service change. Use a CI check that verifies all links in the runbook return HTTP 200. Set a calendar reminder to audit all runbooks quarterly.
# CI check: verify all URLs in runbook
grep -oP 'https?://[^\s)\]]+' RUNBOOK.md | sort -u | \
while read url; do
status=$(curl -sI -o /dev/null -w "%{http_code}" "$url")
if [ "$status" != "200" ]; then
echo "BROKEN: $url → $status"
exit 1
fi
done
Pitfall 2: Too Many Alerts, Wrong Severity
Symptom: The on-call phone buzzes 40 times per night. Engineers develop alert fatigue. A real SEV0 gets lost in the noise.
Fix: Every alert must be actionable and correctly prioritized. If an alert fires and the correct response is "acknowledge and ignore," delete the alert. Use error budgets as the gating mechanism — only page when the error budget is burning too fast.
Pitfall 3: Hero Culture
Symptom: One senior engineer tries to solve everything alone. They do not escalate, do not communicate, and 90 minutes later, the SEV2 is now a SEV0.
Fix: Escalation is not weakness — it is process. The runbook's escalation path exists for a reason. The IC's job is to recognize when to pull in more people, not to solo the fix. Institute a hard rule: if the incident is not mitigated within the SLA window, escalation is mandatory, not optional.
Pitfall 4: No Communication During Incidents
Symptom: Stakeholders flood the IC with DMs. "Is it fixed yet?" "When will it be back?" "The CEO is asking." The IC cannot focus on actually fixing the problem.
Fix: The Communications Lead exists for exactly this reason. Their only job is to post status updates at regular intervals (every 15 minutes for SEV1, every 30 for SEV2) so nobody has to ask. Template:
📢 INCIDENT UPDATE — SEV2 — 14:20 UTC
Status: Still investigating. Checkout API returning 503s.
Mitigation attempted: Rollback to v2.4.0 — no improvement.
Current hypothesis: Upstream payment gateway timeout.
Next update: 14:35 UTC
Pitfall 5: Skipping the Postmortem
Symptom: Incident resolved. Everyone is tired. "We'll do the postmortem later." Later never comes.
Fix: Schedule the postmortem during the resolution call. Block 1 hour on everyone's calendar within 48 hours — while memory is fresh. A postmortem done a week later is half as valuable as one done while logs and timelines are still accessible. If your incident management tooling does not auto-schedule postmortems, add it as a manual step in your runbook.
8. Blameless Postmortem Template
A postmortem is a written record of what happened, why, and what will change. It is not about assigning fault. Use this template:
# Postmortem: [INCIDENT TITLE]
## Metadata
- **Incident ID:** INC-YYYY-MMDD-NNN
- **Date:** [YYYY-MM-DD]
- **Duration:** [HH:MM – HH:MM UTC] ([N] minutes)
- **Severity:** SEV[1/2/3]
- **Incident Commander:** @[name]
- **Postmortem Author:** @[name]
- **Status:** [Draft / Reviewed / Published]
## Summary
[One paragraph: what happened, impact, how it was fixed]
## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:03 | Prometheus alert fired: checkout-api 5xx > 5% |
| 14:05 | @alice acknowledged, began investigation |
| 14:08 | Incident declared SEV2 in #incident-2026-0625-003 |
| 14:12 | Rollback to v2.4.0 attempted — no improvement |
| 14:18 | Upstream payment gateway identified as root cause |
| 14:25 | Payment gateway team paged, confirmed outage on their side |
| 14:35 | Retry circuit breaker activated — error rate dropping |
| 14:47 | All metrics green, 10 min verification passed |
| 14:50 | Incident resolved |
## Root Cause
[Detailed technical explanation. What specific change, failure, or condition triggered the incident?]
## Impact
- **Users affected:** ~1,200 (12% of checkout traffic)
- **Revenue impact:** Estimated $3,400 in lost transactions
- **Data loss:** None
- **Security impact:** None
## What Went Well
- Alert fired within 2 minutes of error rate crossing threshold
- Incident Commander declared within 8 minutes of alert
- Communications Lead posted updates every 15 minutes
- Rollback was attempted quickly even though it didn't help
## What Went Poorly
- Payment gateway was not listed in service dependencies — added 13 min to diagnosis
- No circuit breaker was pre-configured for upstream failures
- Secondary on-call (backup IC) was unreachable for 10 min
## Action Items
| # | Action | Owner | Priority | Due |
|---|--------|-------|----------|-----|
| INC-42 | Add payment gateway to service dependency list and runbook | @charlie | P0 | 2026-06-27 |
| INC-43 | Implement circuit breaker with retry for all upstream calls | @alice | P1 | 2026-07-01 |
| INC-44 | Verify secondary on-call contact info in PagerDuty | @diana | P0 | 2026-06-26 |
| INC-45 | Add synthetic check for payment gateway health | @bob | P2 | 2026-07-15 |
## Lessons Learned
[1-3 sentences capturing the key takeaway for the broader org]
Store postmortems in a shared, searchable location. Over time, they become your organization's institutional memory — patterns emerge, recurring root causes become obvious, and you can justify infrastructure investments with real incident data.
9. Conclusion
An incident management runbook does not prevent incidents. What it does is far more valuable: it compresses the time between "something is wrong" and "it is fixed." It removes the cognitive load of deciding what to do under pressure and replaces it with a muscle-memory procedure.
Start today:
- Pick one service. Write its runbook using the template in Section 5.
- Define your severity levels. Get stakeholder alignment — nobody should argue about SEV during an incident.
- Practice. Run a fire drill. Fake an incident and walk through the runbook. Find the gaps before a real outage does.
- Automate one step. Even something small — an auto-created Slack channel or a diagnostic script — saves minutes during your next SEV2.
The best-run SRE teams do not have fewer incidents. They recover faster, communicate better, and learn more from each one. A runbook is how they do it.
Further Reading
- Error Budgets: Stop Wasting Your SRE Team's Time — Use error budgets to determine when to page and when to let it ride
- SLI vs SLO vs SLA: The Real SRE Guide with Examples — The measurement framework that feeds your incident alerting
- Kubernetes Security Best Practices 2026 — Security incidents are incidents too — lock down your clusters
- Kubernetes Monitoring with Prometheus and Grafana — Build the monitoring stack your runbooks depend on