5 exercises — trigger conditions, exact commands, verification steps, rollback procedures, and runbook maintenance. Write runbooks people can follow at 3am.
0 / 5 completed
Runbook writing checklist
Trigger: Exactly when to use this runbook. Symptoms, alert names, conditions.
Prerequisites: Access, credentials, tools needed before starting
Steps: Exact commands, expected outcomes, verification for each step
Escalation: Who to call if a step fails, and how to reach them
Rollback: How to undo each action safely
Last tested: Date and name — outdated runbooks cause incidents
1 / 5
An SRE is writing a runbook step for restarting the payments service. Which version is best for use at 3am during an incident?
The best runbook step has five elements that allow a stressed engineer to follow it at 3am:
1. Exact command: No ambiguity about what to run. Copy-pasteable. 2. Expected outcome: "Pod restarts within 60 seconds" — the engineer knows what success looks like. 3. Verification step: A command to confirm the action worked. 4. Time expectation: "after 2 minutes" — so the engineer knows when to stop waiting. 5. Escalation path: Who to call if this step fails, and how to reach them.
Runbook writing principles: • Write for someone who has never seen this service before • Use active, imperative verbs: Run, Verify, Check, Escalate — not "you might want to" • Never say "if appropriate" or "as needed" — these are decisions the runbook should make for the reader • Include rollback steps • Include links to dashboards and relevant logs
Anti-patterns to avoid: ❌ "Fix the memory issue" (not actionable) ❌ "Check if there are problems" (too vague) ❌ "As per standard procedure" (circular reference to nothing)
2 / 5
A runbook begins: "This runbook covers restarting the auth service." A senior SRE says it's incomplete. The critical missing element in the runbook's opening is _____.
A runbook without a clear trigger condition is dangerous: engineers either use it when they shouldn't, or don't use it when they should.
A complete runbook opening includes:
Purpose: "This runbook describes how to restart the auth service." When to use: "Use when: (1) PagerDuty alert 'auth-service-high-latency' fires AND p99 latency > 500ms for 5+ minutes, OR (2) auth service Pods are in CrashLoopBackOff." Prerequisites: "Requires access to: production namespace, Datadog, #incidents-channel." Estimated duration: "~5 minutes if Pods restart normally; escalate if >15 minutes." Last tested: "2026-03-15 by Alice Chen"
Why the trigger condition matters most: Engineers arriving at an incident are stressed. They need to know immediately: "Is this the right runbook for my situation?" Without explicit trigger conditions, they either apply the wrong runbook (causing additional damage) or waste time reading through to find out if it applies.
Runbook metadata vocabulary: "This runbook applies when…" "Trigger conditions: …" "Do NOT use this runbook if…" "Prerequisites: the following access / tools are required: …"
3 / 5
A runbook step says: "Scale up the consumer service." A new engineer is confused. Which rewritten step is clearest?
Good runbook steps always answer: exactly what action, with what value, how to verify success.
Breaking down the correct answer: • Exact command: `kubectl scale deployment/consumer-service --replicas=10 -n production` — no guessing what "scale up" means • Context: "increases from the current replica count to 10" — the engineer knows what's changing • Verification: specific metric (consumer lag), specific threshold (50,000), specific tool (Datadog) with a link, specific timeframe (3 minutes)
Why vague runbook steps cause incidents: "Scale up" → how many? 5? 20? 100? "If needed" → who decides when it's needed? During an incident, this becomes a debate "Handle the load" → not measurable — how do you know it's working?
Runbook action verb vocabulary: • "Run:" — for commands • "Navigate to:" — for UI steps • "Verify that:" — for checks • "If [condition], then:" — for conditional steps • "Wait until:" — for waiting with a success criterion • "Escalate to [team] via [channel] if:" — for escalation triggers
4 / 5
A runbook includes a section titled "Rollback Procedure." When should this section be used — and why is it essential?
A rollback procedure answers: "If following this runbook makes things worse, how do we undo what we just did?"
Why rollback sections are essential: • Runbook actions sometimes have unexpected side effects • A configuration change applied under pressure might have been the wrong one • An engineer needs to know they can safely undo actions without escalating
What a good rollback section looks like: "If after scaling to 10 replicas the consumer lag increases or error rate rises above 5%, roll back: kubectl scale deployment/consumer-service --replicas=3 -n production and notify #incidents with the current metrics."
Key rollback vocabulary: "Revert this change by:" "To restore to the previous state:" "Undo [step N] by running:" "Check the previous replica count in Datadog before scaling down" "Tag this incident with [label] so the post-mortem covers the rollback"
Rollback anti-patterns: ❌ "Undo the changes" (not actionable) ❌ No rollback section at all ❌ "Ask someone" (by 3am, "someone" may not be available) ❌ Rollback that requires the same expertise as the original fix
5 / 5
After an engineer successfully follows a runbook to resolve an incident, they notice one of the steps was outdated and no longer matched the current Kubernetes namespace. What is the correct action regarding the runbook?
A runbook that isn't maintained becomes more dangerous over time than having no runbook at all — because it creates false confidence.
Runbook maintenance vocabulary: "Update the runbook to reflect the current namespace/configuration." "Add a note: 'Updated 2026-04-07 — namespace changed from `legacy` to `production` (Alex T.).'" "Notify the team: 'Updated the auth restart runbook — step 3 namespace was wrong. Fixed.'"
When to update runbooks: • After any incident where a step was wrong, unclear, or missing • After infrastructure changes (new namespace, new tool, changed service name) • After any "first use" — the person using a runbook for the first time always finds inaccuracies • On a regular review schedule (quarterly at minimum for critical runbooks)
Runbook update best practices: • Add "Last tested by" and "Last updated" fields • Use version history or Git blame so teams can see what changed • Include who to contact if a step is unclear • Run fire drills: have a new team member follow the runbook and identify gaps
The "blameless maintainer" principle: updating a runbook after finding an error is a contribution, not an admission of fault.