Introduction
Every company has incidents. The difference between a company that learns from them and one that repeats the same outage every quarter is how they respond and review. SRE formalizes this into two things: a structured incident management process, and a blameless postmortem culture.
Most engineers have been in a war room at 2 AM where nobody knows who's the incident commander, five people are SSH-ing into production simultaneously, and someone is running kill -9 on a random process out of desperation. That is not incident management — it's chaos.
Incident management in SRE gives you a repeatable framework: severity levels so everyone calibrates urgency the same way, a response playbook so nobody has to think about process during the crisis, and a postmortem template so you actually learn from what happened.
This guide covers all three. By the end, you'll have templates you can copy into your own incident response tooling and a clear understanding of what "blameless" actually means in practice — it's not about being nice, it's about being effective.
Incident Severity Levels
Before any incident starts, your team needs a shared definition of how bad it is. Without severity levels, one engineer declares SEV-0 while another shrugs and says "it's just a blip." Calibration matters because severity determines who pages, how fast they respond, and how the postmortem is handled.
Here is the severity framework used by most SRE organizations in 2026:
SEV-0: Critical / All-Hands
Definition: User-facing service is completely down. Revenue loss is measurable per minute. Data loss or corruption is happening. Security breach affecting customer data is in progress.
Response: Page the entire on-call rotation simultaneously. Incident commander declares SEV-0 within 2 minutes. Every minute matters — you are losing money or customer trust irreversibly.
Examples:
api.yourcompany.comreturns 503 for all requests globally- Primary database is unreachable, causing checkout to fail for 100% of users
- Credential leak detected with active exploitation
SLO context: If your error budget for availability is 0.1% (99.9% SLO), a SEV-0 burning at roughly 1% of your monthly error budget per hour means you have ~4 hours of total SEV-0 time per quarter before you violate the SLO. Understanding this relationship — connecting incident severity directly to error budget consumption — is why SRE teams track every SEV-0 minute. For a deeper dive into how error budgets drive these decisions, read our Error Budgets SRE Guide.
SEV-1: Major / High Priority
Definition: A significant feature is broken for most users. No workaround exists. Customer-facing impact is widespread but the core service is still online.
Response: Page primary on-call. Secondary on-call joins within 5 minutes. Incident commander assigned within 10 minutes.
Examples:
- Login returns errors for 30% of users
- Search functionality returns empty results for all queries
- Payment processing fails for a specific region (EU down, US up)
SEV-2: Minor / Degraded
Definition: Partial degradation. Workaround exists. Internal tools affected but customer-facing systems operational.
Response: Page primary on-call. Acknowledge within 30 minutes. No mandatory incident commander.
Examples:
- Admin dashboard is slow but not broken
- Non-critical microservice returning elevated 5xx rates (<5%)
- CI/CD pipeline is down, blocking deploys but not customers
SEV-3: Low / Cosmetic
Definition: No user impact. Minor bug, visual glitch, or technical debt that should be tracked.
Response: Create a ticket. Handle during business hours. No paging.
Examples:
- Typo on a documentation page
- Log line formatted incorrectly
- Non-critical cron job failed once
Why Severity Levels Break Without SLOs
Here is the most common failure mode: a team defines SEV-0 through SEV-3, but nobody agrees on what "critical" actually means because they never defined their Service Level Objectives. "Significant feature broken" means different things to the payments team and the blog team.
Tie every severity level to a measurable SLO. A SEV-0 should directly correspond to "we are burning error budget faster than the burn rate threshold allows." This is the bridge between SLI vs SLO vs SLA and your incident process. If you haven't defined your SLIs and SLOs yet, start there before writing your severity definitions.
Incident Response Playbook Template
When the pager goes off, nobody should be thinking about process. They should be thinking about the problem. That's what a playbook gives you: a script so you can focus your brain on debugging, not role assignment.
Here is a runbook template you can adapt. Copy this into your incident management tool — PagerDuty, incident.io, FireHydrant, or even a shared Google Doc. The exact tool doesn't matter. The structure does.
Incident Roles
Every incident needs exactly these roles assigned within the first 5 minutes. Small teams can double-up — one person can be both incident commander and operations lead on a SEV-2 — but each role must have a named owner.
| Role | Responsibility |
|---|---|
| Incident Commander (IC) | Owns the incident. Makes all decisions. Nobody acts without IC approval. IC is the single point of accountability. |
| Operations Lead (OL) | Runs the technical investigation. Proposes fixes. Executes approved mitigations. |
| Communications Lead (CL) | Manages stakeholder updates, status page posts, customer-facing comms. Keeps external messaging accurate. |
| Scribe | Documents everything in the incident channel or tool. Timestamps every action. |
The IC does not touch production. The IC's job is to coordinate, not to debug. If the IC starts running kubectl commands, nobody is coordinating, and the incident drifts.
Step-by-Step Response Playbook
Phase 1: Triage (Minutes 0-5)
[ ] IC: Declare the incident in the #incidents channel
[ ] IC: Assign severity (SEV-0/1/2/3) based on impact
[ ] IC: Assign all roles — name each role owner explicitly
[ ] OL: Start the investigation — check dashboards, logs, recent deploys
[ ] CL: Draft initial status page update (even if blank — "We are investigating")
[ ] Scribe: Start the incident timeline in a shared document
Phase 2: Investigation (Minutes 5-30)
[ ] OL: State hypothesis aloud: "I think X is happening because Y"
[ ] OL: Check recent changes first — deploys, config changes, infra changes
[ ] OL: Correlate: did the alert fire at the same time as a deploy?
[ ] OL: Check dependencies: is the database up? Is the upstream API responding?
[ ] IC: Enforce 15-minute checkpoints — if no progress, escalate or swap OL
[ ] CL: Send first status update to stakeholders (if SEV-0 or SEV-1)
[ ] Scribe: Log every action with timestamp: "14:03 OL checked DB — healthy"
Phase 3: Mitigation (Minutes 30-90)
[ ] OL: Propose mitigation to IC — state what, how, and risk
[ ] IC: Approve or reject mitigation. IC owns the decision.
[ ] OL: Execute mitigation with IC aware
[ ] OL: Verify mitigation — did the error rate drop? Did latency recover?
[ ] IC: If mitigation fails, declare reset and go back to investigation
[ ] CL: Update status page: "Mitigation applied, monitoring results"
Phase 4: Resolution (Variable)
[ ] IC: Declare incident resolved when monitoring shows green for 5+ minutes
[ ] CL: Post resolution notice: "Incident resolved at HH:MM UTC. Postmortem to follow."
[ ] IC: Set postmortem date — within 48 hours, no exceptions
[ ] Scribe: Lock the timeline document — it is now immutable evidence
What the Playbook Looks Like in Practice
In your incident channel, it should read like this:
IC [14:01]: SEV-1 declared. checkout-api returning 500s, 40% error rate.
IC: @alice. OL: @bob. CL: @charlie. Scribe: @diana.
Phase 1 starting.
OL [14:03]: Hypothesis: deploy #847 3min ago changed the DB connection pool.
Checking DB metrics now.
OL [14:05]: Confirmed. Connection pool exhausted. New deploy halved the max
connections. No alerts on DB side because pool exhaustion is silent.
OL [14:07]: Proposed mitigation: rollback deploy #847. Risk: zero — it's the
previous known-good state.
IC [14:08]: Approved. Execute rollback.
OL [14:09]: Rollback triggered. Monitoring recovery...
OL [14:12]: Error rate down to 0.2%. Latency normalizing.
IC [14:17]: 5 minutes green. Declaring resolved at 14:17 UTC.
Postmortem scheduled for June 28, 10:00 UTC.
Notice what's absent: blame. Nobody asked "who deployed that?" or "why wasn't this caught?" Those are postmortem questions. During the incident, the only question is "what's broken and how do we fix it?"
For a deeper template covering the full runbook lifecycle including on-call rotations and escalation policies, see our Incident Management Runbook Template.
The Blameless Postmortem Template
The postmortem is where incidents become learning. Google's SRE book calls it "the primary mechanism for organizational learning from incidents." It is not a blame document. It is not a performance review. It is a forensic analysis of what the system did, why it did it, and how to prevent recurrence.
Postmortem Structure
Every postmortem should contain these sections, in this order. Write it as a narrative, not a checklist — but include every section.
# Postmortem: [Incident Title]
**Date:** YYYY-MM-DD
**Duration:** Start — End (UTC)
**Severity:** SEV-0 / SEV-1 / SEV-2
**Authors:** [Name], [Name]
**Status:** Draft / Review / Final
## Executive Summary
One paragraph. What happened, impact, root cause, and fix. Write this for a VP
who will only read the first 100 words.
## Timeline (All Times UTC)
| Time | Event |
|------|-------|
| 14:01 | Alert fired: checkout-api 5xx rate > 5% |
| 14:03 | IC declared SEV-1, roles assigned |
| 14:05 | OL identified deploy #847 as trigger |
| 14:07 | Rollback proposed and approved |
| 14:09 | Rollback executed |
| 14:12 | Error rate recovered to baseline |
| 14:17 | Incident resolved |
## What Went Well
List at least 3 things. This section is mandatory — it trains the team to see
incidents as systems that mostly work, with specific gaps.
Example:
- Alert fired within 2 minutes of error rate exceeding threshold
- OL correctly identified the deploy as the trigger in under 5 minutes
- Rollback was fast and uneventful — CI/CD pipeline performed as designed
## What Went Wrong
List at least 3 things. Be specific. "Communication was bad" is useless. "The
IC did not announce the severity until minute 8 — 6 minutes past the SEV-1
threshold" is actionable.
Example:
- No pre-deploy canary caught the connection pool change
- DB connection pool metrics were not monitored — pool exhaustion was silent
- The deploy pipeline does not pause on anomalous error rates
## Root Cause Analysis (5 Whys)
Start with the symptom, ask "why" five times.
<ol>
<li>Why were users seeing 500 errors? → The checkout-api returned 500s.</li>
<li>Why was checkout-api returning 500s? → The database connection pool was exhausted.</li>
<li>Why was the pool exhausted? → Deploy #847 halved the max connection limit.</li>
<li>Why did the deploy change the connection limit? → The config default was overridden accidentally in a PR refactor — no review caught it.</li>
<li>Why didn't review catch it? → Connection pool config is not part of the PR checklist; there is no automated diff for infrastructure config changes.</li>
</ol>
Root cause: **No automated validation of infrastructure config changes in CI.**
## Action Items
Each item must have an owner and a due date. No "we should look into this."
Every item is a closed-loop task.
| # | Action | Owner | Due Date | Priority |
|---|--------|-------|----------|----------|
| 1 | Add connection pool metrics to Prometheus + alert on >80% utilization | @bob | Jul 5 | P0 |
| 2 | Add pre-deploy canary that ramps traffic over 5 minutes | @alice | Jul 12 | P0 |
| 3 | Add infrastructure config diff to PR checklist + CI check | @charlie | Jul 5 | P1 |
| 4 | Update deploy pipeline to auto-rollback on error rate spike | @bob | Jul 19 | P1 |
## Appendix: Graphs, Logs, Dashboards
Include screenshots of the dashboard during the incident, relevant log excerpts,
and links to the monitoring dashboards. Future you (or the next on-call) needs
these for pattern matching.
Postmortem Timeline Rule
The postmortem must be written within 48 hours of incident resolution. After 48 hours, memory degrades. The timeline becomes fuzzy. Engineers rationalize their decisions differently. The Scribe's timeline document from the incident phase is your source of truth — use it verbatim.
Who Attends the Postmortem Review
- Required: IC, OL, CL, Scribe
- Required: Service owner (if not already in the room)
- Optional: Affected stakeholders, engineering manager
- Never: Anyone whose presence would make people defensive. If a VP's attendance shuts down honest analysis, do a pre-read with them separately.
The Postmortem Read-Out Meeting
The postmortem document is written asynchronously. The meeting is a 30-minute synchronous review where the authors present their findings, action items are discussed and prioritized, and the document is marked "Final."
One anti-pattern: a 90-minute meeting where the document gets written live while 12 people watch. Write it first. Meet to review. Move on.
Building a Blameless Culture
"Blameless" is the most misunderstood word in SRE. It does not mean "nobody is accountable." It means the system failed, not the human.
When a nurse administers the wrong medication in a hospital, a blameless investigation doesn't say "the nurse was careless." It asks: were the medication labels too similar? Was the pharmacy shelf organized confusingly? Was the nurse working a 14-hour shift? The nurse made the error, but the system created the conditions for the error to be possible.
Software is the same. An engineer pushed a bad deploy — but why didn't the CI pipeline catch it? Why wasn't there a canary? Why was it possible to push to production on a Friday at 5 PM? The engineer pulled the trigger, but the system loaded the gun and aimed it.
The Practical Rules of Blamelessness
These are not aspirational values. They are rules you enforce in every postmortem conversation.
Rule 1: No "who" questions. Ask "what happened" and "why did the system allow this," never "who did this." Replace "who deployed that?" with "what process allowed that deploy to reach production?"
Rule 2: Assume good intent. Every engineer who pushed code, approved a PR, or ran a command was trying to do their job correctly. Start from that assumption. If you find evidence otherwise (rare), handle it in a 1:1, not a postmortem.
Rule 3: Focus on systemic fixes, not individual remediation. "Bob will be more careful next time" is not an action item and it is not a fix. The entire point of a postmortem is to find changes to the system — code, config, process, tooling — that prevent recurrence regardless of who is on call.
Rule 4: The postmortem document belongs to the authors. It is their narrative. If a manager or VP wants to add something, they can comment, but the authors own the final text. This preserves psychological safety.
Rule 5: Share postmortems widely. Every postmortem should be readable by every engineer in the company. When the payments team has a postmortem, the auth team reads it. Cross-team learning is the compound interest of postmortems.
What Blameless Culture Is NOT
| What People Think | What It Actually Means |
|---|---|
| Nobody is held accountable | Everyone is accountable — to the system, not to a manager's judgment |
| We don't talk about mistakes | We talk about mistakes openly, in structured forums, with documented outcomes |
| Postmortems are optional | Postmortems are mandatory for SEV-0, SEV-1, and SEV-2 incidents — no exceptions |
| "It was my fault" is the right answer | "Here is what the system allowed" is the right answer |
How to Start When Your Culture Is Not Blameless
Most companies do not have a blameless culture. They have a culture where postmortems are called "root cause analysis" and someone gets a talking-to afterward. Changing this is hard. Here is what works:
- Start with one team. Don't try to change the entire org. Pick one team. Run blameless postmortems for 3 months. Show the data: fewer repeat incidents, faster recovery, better action items.
- Get leadership to model it. When a director says in a postmortem review "I would have made the same mistake — let's fix the tooling," it signals safety more than any policy document.
- Make the output visible. Share the postmortems. When other teams see "wait, they had an incident, wrote it up, nobody got fired, and they actually fixed things," adoption spreads.
- Never retroactively blame. If you run blameless postmortems for 6 months and then fire someone over an incident, you have destroyed all trust permanently. The commitment is binary — either you are blameless or you are not.
For a broader perspective on how SRE culture fits into the wider engineering landscape — from DevOps practices to platform engineering — see our SRE vs DevOps vs Platform Engineering comparison.
Tooling for Incident Management in 2026
The process matters more than the tool, but the right tool makes the process invisible. Here is the 2026 landscape:
| Tool | Best For | Notes |
|---|---|---|
| incident.io | Slack-native incident management | Declare incidents, assign roles, auto-generate postmortems — all in Slack. Best UX in 2026. |
| FireHydrant | Full lifecycle management | Incident declaration → runbooks → status pages → postmortems. Strong integrations. |
| PagerDuty | On-call + alerting | Still the standard for on-call rotations and alert routing. Incident management features are newer but solid. |
| Grafana Incident | Teams already on Grafana | Native integration with Grafana dashboards, alerts, and SLO tracking. Great if you're in that ecosystem. |
| Opsgenie (Atlassian) | Jira-native teams | Deep Jira integration. If you live in Jira, this is the path of least resistance. |
Observability During Incidents
During a SEV-0, you need observability that answers questions in seconds, not minutes. Three things matter:
- Dashboards that load fast. If your Grafana dashboard takes 30 seconds to render during an incident, you have a dashboard problem. Pre-aggregate. Cache. Use recording rules in Prometheus.
- Traces that show the full path. When
checkout-apireturns 500s, you need to know if it's the API itself, the database it calls, or the upstream payment processor. Distributed tracing gives you that in one view. We covered full OpenTelemetry tracing setup in our OpenTelemetry Tutorial. - Logs you can query without learning a query language. If your on-call engineer has to read LogQL documentation during an incident, your logging tool is wrong. Structured logging with intuitive search — Loki with label-based filtering, or a managed alternative — matters.
Security Incidents: A Special Case
Security incidents (breaches, credential leaks, unauthorized access) require a modified playbook. The postmortem is still blameless, but the investigation phase has additional constraints:
- Do not discuss in public Slack channels. Use a restricted channel or encrypted communication.
- Preserve evidence. Do not delete logs, terminate instances, or rotate credentials until evidence is collected.
- Legal and compliance may be involved. The CL role expands to include legal liaison.
For security-specific hardening that prevents many incident triggers, review our Kubernetes Security Best Practices guide — covering Pod Security Standards, RBAC, network policies, and container scanning with Trivy.
Conclusion
The difference between a team that burns out and a team that gets better is what happens after the incident resolves. A structured incident management process means your engineers spend their mental energy on the problem, not on process. A blameless postmortem means every incident is an investment in system resilience.
The three things to take away:
- Define severity levels before you need them. Tie SEV-0 directly to error budget burn. Without SLOs, severity is arbitrary.
- Use the playbook. Incident commander does not touch production. Operations lead runs the investigation. Communications lead manages stakeholders. Scribe documents everything. Roles prevent chaos.
- Write postmortems within 48 hours. Use the 5 Whys. Produce action items with owners and due dates. Share them widely. Never use them to evaluate individuals.
Incidents are inevitable. Learning from them is optional. The SRE practices in this guide make the difference.