How to Write a Post-Mortem / Incident Report in English

When production goes down at 3am, the incident itself is only half the story. The other half — the one that prevents the next outage — is the post-mortem report you write after the chaos settles.

For non-native English speakers on engineering teams, writing a post-mortem presents two challenges at once: you need to communicate technical detail precisely and use the right professional register. Get either one wrong and the document loses its value.

This guide covers the structure, vocabulary, and phrasing of blameless post-mortems — the standard format used at Google, Atlassian, PagerDuty, and most modern engineering organisations.

What is a post-mortem?

A post-mortem (also: incident report, incident retrospective, or PIR — Post-Incident Review) is a structured document that analyses a significant incident after it has been resolved. It answers:

What happened and when (timeline)
Why it happened (root cause analysis)
How many users were affected and for how long (impact)
What actions will prevent a recurrence (action items)

The goal is learning, not blame. The blameless culture pioneered by Google’s SRE team rests on the principle that incidents are caused by system failures and process gaps — not individual incompetence. The language of your post-mortem must reflect this.

The anatomy of a post-mortem

1. Title and metadata

Title:      Checkout Service Outage — 2026-03-15
Severity:   SEV-1
Duration:   47 minutes (14:22–15:09 UTC)
Author:     Jane Smith, Sr. SRE
Reviewers:  Backend Lead, CTO
Status:     Draft → In Review → Closed

Language tips:

Use the format [Service Name] Outage / Degradation / Incident — [Date] for the title
Severity levels vary by organisation: SEV-1/SEV-2, P0/P1, Critical/Major
Always use UTC timestamps in incident reports to avoid timezone confusion

2. Summary (Executive Summary)

A 3–5 sentence overview written for a non-technical reader. Covers what failed, when, the impact, and the key cause.

Template:

On [date], the [service/system] experienced [type of degradation] beginning at [time UTC]. The incident lasted [duration] and affected approximately [N% of users / N users / specific functionality]. The root cause was [brief description]. The incident was resolved by [mitigation action].

Real example:

On 2026-03-15, the Checkout Service experienced a complete outage beginning at 14:22 UTC. The incident lasted 47 minutes and prevented all users from completing purchases. The root cause was a misconfigured database connection pool added during the 14:18 deployment. The service was restored by rolling back to the previous release version.

Key vocabulary:

experienced an outage — the service was completely unavailable
experienced degradation — the service was partially working (slower, errors for some users)
affected — which users or functionality were impacted
root cause was — the underlying reason (not just the proximate trigger)
resolved by — the action that restored service

3. Timeline

The timeline is the chronological record of what happened, when, and who acted. It is the most factual section — avoid interpretation here.

Format:

14:18 UTC  Deployment of version 2.4.1 started (engineer: J. Smith)
14:22 UTC  Error rate on /checkout endpoint rose above 5%
14:23 UTC  PagerDuty alert fired: "Checkout error rate > 1%"
14:27 UTC  On-call engineer paged and acknowledged the alert
14:31 UTC  Database connection pool exhaustion confirmed as likely cause
14:44 UTC  Decision taken to roll back to version 2.4.0
14:47 UTC  Rollback deployment started
15:02 UTC  Rollback complete; error rate returned to baseline
15:09 UTC  Incident declared resolved; monitoring verified stable

Language tips for the timeline:

Use simple past tense and passive voice for events — it keeps the focus on what happened, not who did it:

Blameful ❌	Blameless ✅
“John deployed broken code"	"Version 2.4.1 was deployed"
"The on-call engineer ignored the alert"	"The alert went unacknowledged for 4 minutes"
"Someone misconfigured the pool"	"The connection pool was misconfigured with a value of 5 (expected: 50)”

Common timeline verbs:

started / began / was initiated
fired / triggered / was raised (for alerts)
was acknowledged / was escalated
was identified / was confirmed / was reproduced
was mitigated / was resolved / was restored
was declared (incident declared resolved)

4. Root Cause Analysis (RCA)

This is the most technically demanding section — and the most important. It explains why the incident occurred, not just what happened.

The 5 Whys technique:

The goal is to trace the chain of causality from the symptom back to the systemic root cause.

Why did the checkout service fail? → Because the database connection pool was exhausted.

Why was the connection pool exhausted? → Because it was configured with max_connections: 5 instead of 50.

Why was it misconfigured? → Because the default value in the new config template was changed without documentation.

Why was it undocumented? → Because there is no required review step for default config changes.

Root cause: Configuration templates can be modified without a mandatory review step that checks critical parameters against expected ranges.

Useful phrases for the RCA section:

“The root cause was…”
“The proximate cause was [X]; the underlying cause was [Y].”
“This was triggered by… and compounded by…”
“The failure mode was…”
“The system lacked a safeguard to prevent…”
“No alerting existed for…”
“The configuration change went undetected because…”

Contributing factors (also: contributing causes):

“Several contributing factors amplified the impact: (1) the staging environment uses an in-memory database and did not reproduce the connection pool behaviour; (2) the deployment occurred during peak traffic hours.”

5. Impact

Quantify the blast radius. Be honest — underreporting damages trust.

Dimensions to cover:

Duration — exactly how long the issue lasted
User impact — percentage or number of affected users; specific features affected
Business impact — revenue, SLA breach, customer complaints, data loss (if any)
Geographic / segment impact — all users, or only a subset?

Template:

The outage lasted 47 minutes (14:22–15:09 UTC). During this window, 100% of checkout attempts failed across all regions. Based on average transaction volume, approximately 1,900 transactions were blocked. No data loss occurred. The incident triggered an SLA breach notification to three enterprise customers.

Key vocabulary:

blast radius — the scope of impact; how widely an issue spread
affected N% of users / impacted N users
all regions / all users / users in [region]
100% error rate / partial degradation
no data loss occurred / data integrity was maintained
SLA breach — the incident exceeded defined service level agreement thresholds
customer-facing — visible to end users (as opposed to internal)

6. Action Items

The most actionable section. Each item must be specific, assigned to a person, and time-bounded.

Format:

Action	Owner	Priority	Due
Update config template defaults and add inline comments for all critical parameters	J. Smith	High	2026-03-22
Add automated validation step in CI to check connection pool config ranges	DevOps team	High	2026-03-29
Create staging database that mirrors production connection limits	Platform team	Medium	2026-04-05
Add metric and alert for connection pool utilisation > 80%	SRE	Medium	2026-03-25
Schedule incident review presentation for the full engineering team	EM	Low	2026-03-20

Language for action items:

Always use the imperative (command) form for action item descriptions — they are instructions, not observations:

Observation (wrong register)	Action item (correct register)
“The config template was confusing."	"Rewrite the connection pool config template with inline documentation."
"There was no alert."	"Add a PagerDuty alert for pool utilisation > 80%."
"Staging didn’t match production."	"Configure staging to use connection pool limits that match production.”

Blameless language: a guide

The hardest part of post-mortem writing for non-native speakers is often the cultural expectation of blamelessness. In many engineering cultures, direct attribution of failure to a person (“X broke Y”) is normal. In Western tech post-mortem culture, this is seen as counterproductive.

The principle: people do not cause incidents — systems, processes, and insufficient safeguards do.

Rewrite these phrases:

Blameful ❌	Blameless ✅
“The developer pushed untested code."	"The code reached production without adequate test coverage."
"The DBA dropped the wrong table."	"A DROP TABLE command executed on the production database without a dry-run check."
"The engineer forgot to update the runbook."	"The runbook had not been updated to reflect the new deployment process."
"Nobody noticed the alert."	"The alert was not acknowledged within the escalation window."
"She made a mistake."	"The configuration contained an error that was not caught by pre-deployment validation.”

The subject shifts from a person to a process, system, configuration, or practice.

Vocabulary quick reference

Term	Definition
post-mortem	Document that analyses an incident after resolution; can also be a meeting
incident	An unplanned event that disrupts normal service operation
outage	Complete unavailability (100% failure rate)
degradation	Partial reduction in service quality (elevated error rate, high latency)
root cause	The underlying systemic reason for the incident
proximate cause	The immediate trigger (the thing that actually broke)
contributing factor	A condition that made the incident worse or more likely
mitigation	A temporary action that reduces impact before the root cause is fixed
remediation	The permanent fix that addresses the root cause
rollback	Reverting to a previous version of code or configuration
blast radius	The scope of who or what was affected
on-call	The engineer responsible for responding to alerts during a given rotation
SLA / SLO / SLI	Service Level Agreement / Objective / Indicator — contractual and operational availability targets
MTTR	Mean Time to Restore — the average time to resolve an incident
MTTD	Mean Time to Detect — the average time to discover an incident
blameless	Without attributing fault to individuals; focused on systems and processes
action item	A specific, assigned, time-bounded task resulting from the post-mortem

Template

Copy and adapt this template for your team:

# [Service Name] [Outage / Degradation / Incident] — [YYYY-MM-DD]

**Severity:** SEV-1 / SEV-2
**Duration:** [N] minutes ([HH:MM]–[HH:MM] UTC)
**Author:** [Name]
**Status:** Draft

---

## Summary
[3–5 sentence overview of what failed, when, the impact, and the key cause.]

---

## Timeline

| Time (UTC) | Event |
|---|---|
| HH:MM | [Event description in past tense] |
| HH:MM | [Event description] |

---

## Root Cause Analysis
The root cause was [description].

**5 Whys:**
1. Why did [symptom] occur? → [Reason 1]
2. Why did [Reason 1] occur? → [Reason 2]
...
Root cause: [Systemic gap]

**Contributing factors:**
- [Factor 1]
- [Factor 2]

---

## Impact
Duration: [N] minutes
Users affected: [N% / all users / users in region X]
Features affected: [specific features or API endpoints]
Data loss: None / [description]
SLA breach: Yes / No

---

## Action Items

| Action | Owner | Priority | Due |
|---|---|---|---|
| [Specific action in imperative form] | [Name] | High/Medium/Low | [Date] |

---

## Lessons Learned
- [What we learned about our system]
- [What we learned about our process]
- [What went well during response]

What good post-mortems have in common

✓ Specific, not vague — “connection pool exhausted at 14:31 UTC with 0/5 available connections” is more useful than “database issues”

✓ Quantified — “47 minutes”, “1,900 blocked transactions”, “3 SLA notifications” beats “significant downtime”

✓ Systemic root cause — stops at “no validation step existed”, not at “the config was wrong”

✓ Actionable items — “Add CI check for pool config ranges by 2026-03-29 (Owner: DevOps)” not “improve the deployment process”

✓ Blameless tone — systems and processes fail, not people

✓ Written while memory is fresh — start within 24 hours, complete within 48–72 hours

Writing a thorough post-mortem is one of the highest-value activities a senior engineer can do. It turns a painful incident into institutional knowledge. The clearer your English, the more that knowledge transfers to the whole team.