How to Write a Post-Mortem / Incident Report in English
Templates, phrases, and structure for writing blameless post-mortem reports in English: timeline, root cause analysis, impact statement, and action items. With real examples for DevOps and SRE engineers.
When production goes down at 3am, the incident itself is only half the story. The other half — the one that prevents the next outage — is the post-mortem report you write after the chaos settles.
For non-native English speakers on engineering teams, writing a post-mortem presents two challenges at once: you need to communicate technical detail precisely and use the right professional register. Get either one wrong and the document loses its value.
This guide covers the structure, vocabulary, and phrasing of blameless post-mortems — the standard format used at Google, Atlassian, PagerDuty, and most modern engineering organisations.
What is a post-mortem?
A post-mortem (also: incident report, incident retrospective, or PIR — Post-Incident Review) is a structured document that analyses a significant incident after it has been resolved. It answers:
- What happened and when (timeline)
- Why it happened (root cause analysis)
- How many users were affected and for how long (impact)
- What actions will prevent a recurrence (action items)
The goal is learning, not blame. The blameless culture pioneered by Google’s SRE team rests on the principle that incidents are caused by system failures and process gaps — not individual incompetence. The language of your post-mortem must reflect this.
The anatomy of a post-mortem
1. Title and metadata
Title: Checkout Service Outage — 2026-03-15
Severity: SEV-1
Duration: 47 minutes (14:22–15:09 UTC)
Author: Jane Smith, Sr. SRE
Reviewers: Backend Lead, CTO
Status: Draft → In Review → Closed
Language tips:
- Use the format
[Service Name] Outage / Degradation / Incident — [Date]for the title - Severity levels vary by organisation: SEV-1/SEV-2, P0/P1, Critical/Major
- Always use UTC timestamps in incident reports to avoid timezone confusion
2. Summary (Executive Summary)
A 3–5 sentence overview written for a non-technical reader. Covers what failed, when, the impact, and the key cause.
Template:
On [date], the [service/system] experienced [type of degradation] beginning at [time UTC]. The incident lasted [duration] and affected approximately [N% of users / N users / specific functionality]. The root cause was [brief description]. The incident was resolved by [mitigation action].
Real example:
On 2026-03-15, the Checkout Service experienced a complete outage beginning at 14:22 UTC. The incident lasted 47 minutes and prevented all users from completing purchases. The root cause was a misconfigured database connection pool added during the 14:18 deployment. The service was restored by rolling back to the previous release version.
Key vocabulary:
- experienced an outage — the service was completely unavailable
- experienced degradation — the service was partially working (slower, errors for some users)
- affected — which users or functionality were impacted
- root cause was — the underlying reason (not just the proximate trigger)
- resolved by — the action that restored service
3. Timeline
The timeline is the chronological record of what happened, when, and who acted. It is the most factual section — avoid interpretation here.
Format:
14:18 UTC Deployment of version 2.4.1 started (engineer: J. Smith)
14:22 UTC Error rate on /checkout endpoint rose above 5%
14:23 UTC PagerDuty alert fired: "Checkout error rate > 1%"
14:27 UTC On-call engineer paged and acknowledged the alert
14:31 UTC Database connection pool exhaustion confirmed as likely cause
14:44 UTC Decision taken to roll back to version 2.4.0
14:47 UTC Rollback deployment started
15:02 UTC Rollback complete; error rate returned to baseline
15:09 UTC Incident declared resolved; monitoring verified stable
Language tips for the timeline:
Use simple past tense and passive voice for events — it keeps the focus on what happened, not who did it:
| Blameful ❌ | Blameless ✅ |
|---|---|
| “John deployed broken code" | "Version 2.4.1 was deployed" |
| "The on-call engineer ignored the alert" | "The alert went unacknowledged for 4 minutes" |
| "Someone misconfigured the pool" | "The connection pool was misconfigured with a value of 5 (expected: 50)” |
Common timeline verbs:
- started / began / was initiated
- fired / triggered / was raised (for alerts)
- was acknowledged / was escalated
- was identified / was confirmed / was reproduced
- was mitigated / was resolved / was restored
- was declared (incident declared resolved)
4. Root Cause Analysis (RCA)
This is the most technically demanding section — and the most important. It explains why the incident occurred, not just what happened.
The 5 Whys technique:
The goal is to trace the chain of causality from the symptom back to the systemic root cause.
Why did the checkout service fail? → Because the database connection pool was exhausted.
Why was the connection pool exhausted? → Because it was configured with
max_connections: 5instead of50.Why was it misconfigured? → Because the default value in the new config template was changed without documentation.
Why was it undocumented? → Because there is no required review step for default config changes.
Root cause: Configuration templates can be modified without a mandatory review step that checks critical parameters against expected ranges.
Useful phrases for the RCA section:
- “The root cause was…”
- “The proximate cause was [X]; the underlying cause was [Y].”
- “This was triggered by… and compounded by…”
- “The failure mode was…”
- “The system lacked a safeguard to prevent…”
- “No alerting existed for…”
- “The configuration change went undetected because…”
Contributing factors (also: contributing causes):
“Several contributing factors amplified the impact: (1) the staging environment uses an in-memory database and did not reproduce the connection pool behaviour; (2) the deployment occurred during peak traffic hours.”
5. Impact
Quantify the blast radius. Be honest — underreporting damages trust.
Dimensions to cover:
- Duration — exactly how long the issue lasted
- User impact — percentage or number of affected users; specific features affected
- Business impact — revenue, SLA breach, customer complaints, data loss (if any)
- Geographic / segment impact — all users, or only a subset?
Template:
The outage lasted 47 minutes (14:22–15:09 UTC). During this window, 100% of checkout attempts failed across all regions. Based on average transaction volume, approximately 1,900 transactions were blocked. No data loss occurred. The incident triggered an SLA breach notification to three enterprise customers.
Key vocabulary:
- blast radius — the scope of impact; how widely an issue spread
- affected N% of users / impacted N users
- all regions / all users / users in [region]
- 100% error rate / partial degradation
- no data loss occurred / data integrity was maintained
- SLA breach — the incident exceeded defined service level agreement thresholds
- customer-facing — visible to end users (as opposed to internal)
6. Action Items
The most actionable section. Each item must be specific, assigned to a person, and time-bounded.
Format:
| Action | Owner | Priority | Due |
|---|---|---|---|
| Update config template defaults and add inline comments for all critical parameters | J. Smith | High | 2026-03-22 |
| Add automated validation step in CI to check connection pool config ranges | DevOps team | High | 2026-03-29 |
| Create staging database that mirrors production connection limits | Platform team | Medium | 2026-04-05 |
| Add metric and alert for connection pool utilisation > 80% | SRE | Medium | 2026-03-25 |
| Schedule incident review presentation for the full engineering team | EM | Low | 2026-03-20 |
Language for action items:
Always use the imperative (command) form for action item descriptions — they are instructions, not observations:
| Observation (wrong register) | Action item (correct register) |
|---|---|
| “The config template was confusing." | "Rewrite the connection pool config template with inline documentation." |
| "There was no alert." | "Add a PagerDuty alert for pool utilisation > 80%." |
| "Staging didn’t match production." | "Configure staging to use connection pool limits that match production.” |
Blameless language: a guide
The hardest part of post-mortem writing for non-native speakers is often the cultural expectation of blamelessness. In many engineering cultures, direct attribution of failure to a person (“X broke Y”) is normal. In Western tech post-mortem culture, this is seen as counterproductive.
The principle: people do not cause incidents — systems, processes, and insufficient safeguards do.
Rewrite these phrases:
| Blameful ❌ | Blameless ✅ |
|---|---|
| “The developer pushed untested code." | "The code reached production without adequate test coverage." |
| "The DBA dropped the wrong table." | "A DROP TABLE command executed on the production database without a dry-run check." |
| "The engineer forgot to update the runbook." | "The runbook had not been updated to reflect the new deployment process." |
| "Nobody noticed the alert." | "The alert was not acknowledged within the escalation window." |
| "She made a mistake." | "The configuration contained an error that was not caught by pre-deployment validation.” |
The subject shifts from a person to a process, system, configuration, or practice.
Vocabulary quick reference
| Term | Definition |
|---|---|
| post-mortem | Document that analyses an incident after resolution; can also be a meeting |
| incident | An unplanned event that disrupts normal service operation |
| outage | Complete unavailability (100% failure rate) |
| degradation | Partial reduction in service quality (elevated error rate, high latency) |
| root cause | The underlying systemic reason for the incident |
| proximate cause | The immediate trigger (the thing that actually broke) |
| contributing factor | A condition that made the incident worse or more likely |
| mitigation | A temporary action that reduces impact before the root cause is fixed |
| remediation | The permanent fix that addresses the root cause |
| rollback | Reverting to a previous version of code or configuration |
| blast radius | The scope of who or what was affected |
| on-call | The engineer responsible for responding to alerts during a given rotation |
| SLA / SLO / SLI | Service Level Agreement / Objective / Indicator — contractual and operational availability targets |
| MTTR | Mean Time to Restore — the average time to resolve an incident |
| MTTD | Mean Time to Detect — the average time to discover an incident |
| blameless | Without attributing fault to individuals; focused on systems and processes |
| action item | A specific, assigned, time-bounded task resulting from the post-mortem |
Template
Copy and adapt this template for your team:
# [Service Name] [Outage / Degradation / Incident] — [YYYY-MM-DD]
**Severity:** SEV-1 / SEV-2
**Duration:** [N] minutes ([HH:MM]–[HH:MM] UTC)
**Author:** [Name]
**Status:** Draft
---
## Summary
[3–5 sentence overview of what failed, when, the impact, and the key cause.]
---
## Timeline
| Time (UTC) | Event |
|---|---|
| HH:MM | [Event description in past tense] |
| HH:MM | [Event description] |
---
## Root Cause Analysis
The root cause was [description].
**5 Whys:**
1. Why did [symptom] occur? → [Reason 1]
2. Why did [Reason 1] occur? → [Reason 2]
...
Root cause: [Systemic gap]
**Contributing factors:**
- [Factor 1]
- [Factor 2]
---
## Impact
Duration: [N] minutes
Users affected: [N% / all users / users in region X]
Features affected: [specific features or API endpoints]
Data loss: None / [description]
SLA breach: Yes / No
---
## Action Items
| Action | Owner | Priority | Due |
|---|---|---|---|
| [Specific action in imperative form] | [Name] | High/Medium/Low | [Date] |
---
## Lessons Learned
- [What we learned about our system]
- [What we learned about our process]
- [What went well during response]
What good post-mortems have in common
✓ Specific, not vague — “connection pool exhausted at 14:31 UTC with 0/5 available connections” is more useful than “database issues”
✓ Quantified — “47 minutes”, “1,900 blocked transactions”, “3 SLA notifications” beats “significant downtime”
✓ Systemic root cause — stops at “no validation step existed”, not at “the config was wrong”
✓ Actionable items — “Add CI check for pool config ranges by 2026-03-29 (Owner: DevOps)” not “improve the deployment process”
✓ Blameless tone — systems and processes fail, not people
✓ Written while memory is fresh — start within 24 hours, complete within 48–72 hours
Writing a thorough post-mortem is one of the highest-value activities a senior engineer can do. It turns a painful incident into institutional knowledge. The clearer your English, the more that knowledge transfers to the whole team.