How to Write a Database Incident Report in English
A complete guide for DBAs and data engineers: how to write a post-incident report (PIR) for database outages — structure, vocabulary, templates, and professional phrases.
A database incident report (also called a post-incident review or postmortem) is a written account of what went wrong, why it happened, and how to prevent it recurring. Writing a clear, professional incident report is one of the most important communication skills for a DBA or data engineer. This guide walks through the full structure with vocabulary, templates, and ready-to-use phrases.
What Is a Database Incident Report?
A database incident report (also: post-incident review (PIR), postmortem, or root cause analysis (RCA)) is a document written after a database outage, performance degradation, or data integrity event.
Its goals are:
- Transparency — inform stakeholders about what happened
- Accountability — document the timeline and who did what
- Learning — identify the root cause
- Prevention — define action items to prevent recurrence
“We owe the stakeholders a PIR within 48 hours of this outage. The report should be factual, blameless, and include specific action items.”
Key principle: blameless culture. Good incident reports focus on systems, processes, and conditions — not on blaming individuals.
Incident Report Structure
Section 1: Incident Summary
Start with a high-level summary that anyone can read in 30 seconds.
Template:
## Incident Summary
**Incident ID**: INC-2026-047
**Date**: 2026-04-14
**Duration**: 2h 17m (09:43 UTC – 11:58 UTCh 17 minutes
**Severity**: P1 (Critical)
**Impact**: Orders database replica lag exceeded 4 hours; reporting dashboards
displayed stale data for ~3.5 hours; 127 business users affected
**Status**: Resolved
Vocabulary:
Severity — the priority/impact classification of an incident. Common levels:
- P1 / Critical — complete outage, major data loss, revenue impact
- P2 / High — significant degradation or partial outage
- P3 / Medium — degraded performance, minor feature unavailable
- P4 / Low — cosmetic or minimal impact
Duration — total elapsed time from detection to resolution.
Impact — who was affected and how. Always quantify where possible: number of users, transactions, error rate, revenue.
Section 2: Timeline
A precise chronological account of the incident. Use UTC timestamps.
Template:
## Timeline (all times UTC)
| Time | Event |
|-------|-------|
| 09:43 | Monitoring alert fires: replica lag > 60 seconds (threshold: 30s) |
| 09:47 | On-call DBA acknowledges alert |
| 09:52 | Investigation begins; identified that replica I/O write latency spiked |
| 10:05 | Root cause identified: a long-running analytical query blocked replication |
| 10:12 | Long-running query terminated |
| 10:30 | Replica lag begins decreasing |
| 11:58 | Replica lag returns to < 5 seconds; incident resolved |
| 13:00 | Stakeholder update sent |
| 14:30 | Incident report draft completed |
Language tips for the timeline:
- Use passive or active voice consistently: “Alert triggered” or “Monitoring fired alert”
- Be specific about what was detected, identified, implemented, and resolved — these are different events
- Include communication events: “Stakeholders notified”, “Incident bridge opened”
Section 3: Root Cause Analysis
This is the analytical core of the report. Explain what caused the incident and why.
Root Cause Analysis methods:
5 Whys — ask “Why?” repeatedly until you reach the systemic root cause:
The orders database replica was delayed →
Why? A long-running analytical query held a table lock →
Why? The query ran on the primary instead of the read replica →
Why? The ETL job was misconfigured to use the primary connection string →
Why? The ETL job configuration wasn't reviewed during the recent
infrastructure migration
Root cause: Missing configuration review checklist for infrastructure migrations
Template:
## Root Cause Analysis
**Immediate cause**: A long-running analytical query acquired a table lock on the
primary database, blocking replication for 2+ hours.
**Contributing factors**:
1. The ETL pipeline was misconfigured to connect to the primary instead of the
read replica after last month's infrastructure migration.
2. There was no alerting on ETL connection endpoints — the misconfiguration
was not detected for 3 weeks.
3. The replica lag alert threshold (30s) was too high to catch the problem early.
**Root cause**: The infrastructure migration runbook did not include a step to
validate ETL connection configurations after a primary/replica endpoint change.
Section 4: Impact Assessment
Quantify the business impact as precisely as possible.
Template:
## Impact Assessment
**Data integrity**: No data was lost or corrupted. The replica contained stale
data for 3.5 hours; no mutations were made during this window.
**User impact**: 127 users of the reporting dashboard received data that was
up to 4 hours out of date.
**Business impact**: Three scheduled order fulfilment reports were generated
with stale data. Manual re-generation was required post-recovery.
Estimated additional engineering time: 3 hours.
**Customer impact**: No external customer impact detected. Internal operations
teams were affected.
**Revenue impact**: None identified directly; estimated indirect operational cost:
€2,400 in engineer time.
Section 5: What Went Well
Include what worked — the systems, processes, and people that limited the impact. This builds a positive learning culture.
Template:
## What Went Well
- Monitoring detected the replica lag within 4 minutes of onset
- On-call DBA responded within 4 minutes of the alert
- Incident communication was clear — stakeholders received updates at
T+20min, T+60min, and T+2h
- The read replica correctly isolated the impact — the primary was unaffected
and write operations continued normally throughout
Section 6: What Could Be Improved
Honest assessment of gaps in process, tooling, or knowledge.
Template:
## What Could Be Improved
- ETL connection configuration was not validated after migration
- No alert existed for ETL job connection endpoint changes
- The replica lag alert threshold (30s) was too low to provide early warning
before business impact — we were alerted after impact, not before
- The on-call runbook did not include steps for diagnosing replica lag due to
lock contention specifically
Section 7: Action Items
Every incident report must end with specific, owned, time-bound action items.
Template:
## Action Items
| ID | Action | Owner | Priority | Due Date |
|----|--------|-------|----------|----------|
| AI-1 | Add ETL connection endpoint validation to migration runbook | Alex Chen | High | 2026-04-21 |
| AI-2 | Create alert for ETL job connection endpoint misconfigurations | Maria Lopez | High | 2026-04-21 |
| AI-3 | Reduce replica lag alert threshold from 30s to 10s for P1 escalation | Alex Chen | Medium | 2026-04-28 |
| AI-4 | Add lock contention section to on-call replica lag runbook | DBA team | Medium | 2026-05-05 |
| AI-5 | Review all ETL jobs for connection string correctness post-migration | Maria Lopez | High | 2026-04-18 |
Language for action items:
- Use imperative form: “Add…”, “Create…”, “Reduce…”, “Review…”
- Assign one owner per item — not a team
- Set a specific due date — not “soon” or “next quarter”
Useful Phrases
For the summary:
- “This incident resulted in [X] minutes of degraded service for [Y] users.”
- “No data loss or corruption occurred.”
For the root cause section:
- “The immediate cause was… however, the root cause was…”
- “This condition went undetected because…”
- “A contributing factor was the absence of…”
For the lessons section:
- “This incident revealed a gap in our…”
- “We had not anticipated that…”
- “The alert fired after the impact had already begun — we need earlier detection.”
Practice
Deepen your DBA communication vocabulary with the Database Administration exercise set and the DBA learning path.