How to Write a Database Incident Report in English

A database incident report (also called a post-incident review or postmortem) is a written account of what went wrong, why it happened, and how to prevent it recurring. Writing a clear, professional incident report is one of the most important communication skills for a DBA or data engineer. This guide walks through the full structure with vocabulary, templates, and ready-to-use phrases.

What Is a Database Incident Report?

A database incident report (also: post-incident review (PIR), postmortem, or root cause analysis (RCA)) is a document written after a database outage, performance degradation, or data integrity event.

Its goals are:

Transparency — inform stakeholders about what happened
Accountability — document the timeline and who did what
Learning — identify the root cause
Prevention — define action items to prevent recurrence

“We owe the stakeholders a PIR within 48 hours of this outage. The report should be factual, blameless, and include specific action items.”

Key principle: blameless culture. Good incident reports focus on systems, processes, and conditions — not on blaming individuals.

Incident Report Structure

Section 1: Incident Summary

Start with a high-level summary that anyone can read in 30 seconds.

Template:

## Incident Summary

**Incident ID**: INC-2026-047
**Date**: 2026-04-14
**Duration**: 2h 17m (09:43 UTC – 11:58 UTCh 17 minutes
**Severity**: P1 (Critical)
**Impact**: Orders database replica lag exceeded 4 hours; reporting dashboards
            displayed stale data for ~3.5 hours; 127 business users affected
**Status**: Resolved

Vocabulary:

Severity — the priority/impact classification of an incident. Common levels:

P1 / Critical — complete outage, major data loss, revenue impact
P2 / High — significant degradation or partial outage
P3 / Medium — degraded performance, minor feature unavailable
P4 / Low — cosmetic or minimal impact

Duration — total elapsed time from detection to resolution.

Impact — who was affected and how. Always quantify where possible: number of users, transactions, error rate, revenue.

Section 2: Timeline

A precise chronological account of the incident. Use UTC timestamps.

Template:

## Timeline (all times UTC)

| Time  | Event |
|-------|-------|
| 09:43 | Monitoring alert fires: replica lag > 60 seconds (threshold: 30s) |
| 09:47 | On-call DBA acknowledges alert |
| 09:52 | Investigation begins; identified that replica I/O write latency spiked |
| 10:05 | Root cause identified: a long-running analytical query blocked replication |
| 10:12 | Long-running query terminated |
| 10:30 | Replica lag begins decreasing |
| 11:58 | Replica lag returns to < 5 seconds; incident resolved |
| 13:00 | Stakeholder update sent |
| 14:30 | Incident report draft completed |

Language tips for the timeline:

Use passive or active voice consistently: “Alert triggered” or “Monitoring fired alert”
Be specific about what was detected, identified, implemented, and resolved — these are different events
Include communication events: “Stakeholders notified”, “Incident bridge opened”

Section 3: Root Cause Analysis

This is the analytical core of the report. Explain what caused the incident and why.

Root Cause Analysis methods:

5 Whys — ask “Why?” repeatedly until you reach the systemic root cause:

The orders database replica was delayed →
  Why? A long-running analytical query held a table lock →
    Why? The query ran on the primary instead of the read replica →
      Why? The ETL job was misconfigured to use the primary connection string →
        Why? The ETL job configuration wasn't reviewed during the recent
             infrastructure migration
Root cause: Missing configuration review checklist for infrastructure migrations

Template:

## Root Cause Analysis

**Immediate cause**: A long-running analytical query acquired a table lock on the
primary database, blocking replication for 2+ hours.

**Contributing factors**:
1. The ETL pipeline was misconfigured to connect to the primary instead of the
   read replica after last month's infrastructure migration.
2. There was no alerting on ETL connection endpoints — the misconfiguration
   was not detected for 3 weeks.
3. The replica lag alert threshold (30s) was too high to catch the problem early.

**Root cause**: The infrastructure migration runbook did not include a step to
validate ETL connection configurations after a primary/replica endpoint change.

Section 4: Impact Assessment

Quantify the business impact as precisely as possible.

Template:

## Impact Assessment

**Data integrity**: No data was lost or corrupted. The replica contained stale
                   data for 3.5 hours; no mutations were made during this window.

**User impact**: 127 users of the reporting dashboard received data that was
               up to 4 hours out of date.

**Business impact**: Three scheduled order fulfilment reports were generated
                    with stale data. Manual re-generation was required post-recovery.
                    Estimated additional engineering time: 3 hours.

**Customer impact**: No external customer impact detected. Internal operations
                    teams were affected.

**Revenue impact**: None identified directly; estimated indirect operational cost:
                   €2,400 in engineer time.

Section 5: What Went Well

Include what worked — the systems, processes, and people that limited the impact. This builds a positive learning culture.

Template:

## What Went Well

- Monitoring detected the replica lag within 4 minutes of onset
- On-call DBA responded within 4 minutes of the alert
- Incident communication was clear — stakeholders received updates at
  T+20min, T+60min, and T+2h
- The read replica correctly isolated the impact — the primary was unaffected
  and write operations continued normally throughout

Section 6: What Could Be Improved

Honest assessment of gaps in process, tooling, or knowledge.

Template:

## What Could Be Improved

- ETL connection configuration was not validated after migration
- No alert existed for ETL job connection endpoint changes
- The replica lag alert threshold (30s) was too low to provide early warning
  before business impact — we were alerted after impact, not before
- The on-call runbook did not include steps for diagnosing replica lag due to
  lock contention specifically

Section 7: Action Items

Every incident report must end with specific, owned, time-bound action items.

Template:

## Action Items

| ID | Action | Owner | Priority | Due Date |
|----|--------|-------|----------|----------|
| AI-1 | Add ETL connection endpoint validation to migration runbook | Alex Chen | High | 2026-04-21 |
| AI-2 | Create alert for ETL job connection endpoint misconfigurations | Maria Lopez | High | 2026-04-21 |
| AI-3 | Reduce replica lag alert threshold from 30s to 10s for P1 escalation | Alex Chen | Medium | 2026-04-28 |
| AI-4 | Add lock contention section to on-call replica lag runbook | DBA team | Medium | 2026-05-05 |
| AI-5 | Review all ETL jobs for connection string correctness post-migration | Maria Lopez | High | 2026-04-18 |

Language for action items:

Use imperative form: “Add…”, “Create…”, “Reduce…”, “Review…”
Assign one owner per item — not a team
Set a specific due date — not “soon” or “next quarter”

Useful Phrases

For the summary:

“This incident resulted in [X] minutes of degraded service for [Y] users.”
“No data loss or corruption occurred.”

For the root cause section:

“The immediate cause was… however, the root cause was…”
“This condition went undetected because…”
“A contributing factor was the absence of…”

For the lessons section:

“This incident revealed a gap in our…”
“We had not anticipated that…”
“The alert fired after the impact had already begun — we need earlier detection.”

Practice

Deepen your DBA communication vocabulary with the Database Administration exercise set and the DBA learning path.

What Is a Database Incident Report?

Incident Report Structure

Section 1: Incident Summary

Section 2: Timeline

Section 3: Root Cause Analysis

Section 4: Impact Assessment

Section 5: What Went Well

Section 6: What Could Be Improved

Section 7: Action Items

Useful Phrases

Practice

Related Articles