Advanced Listening #incidents #on-call #SRE

Live Incident Bridge Calls

Read 3 incident call transcripts — a payment outage, a database replication incident, and a post-incident handoff — then answer comprehension questions about the reasoning and decisions made under pressure.

How to follow an incident call in English
  • Who is the IC?: the Incident Commander controls the call — identify their directives and decisions
  • Theory vs. fact: engineers distinguish confirmed facts from hypotheses — listen for "we've confirmed", "theory is", "likely"
  • Time pressure: incident calls are time-boxed — "five-minute clock", "in the next two minutes" signal urgency
  • Action items: every call ends with specific owners and deadlines — extract who does what by when
0 / 3 completed
1 / 3
📄 Transcript
[Live incident bridge call — SEV-1. Payment service down. 11 minutes into the call.]
IC (Incident Commander): "Okay, let's get a status update from each team. Engineering lead — where are we?"
Lead: "We've confirmed the payment service is returning 503s across all regions. Started at 14:32 UTC. The service processes roughly 1,200 transactions per minute normally — all of that is failing. We've isolated it to the payment service; the API gateway and auth service are healthy."
IC: "Do we have a theory on root cause yet?"
Lead: "Two theories. Theory one: a configuration change was deployed at 14:28 — four minutes before the incident. It's a likely candidate. Theory two: the upstream payment provider had a reported incident starting at 14:25, which predates ours — so there could be a dependency failure. We're running diagnostics on both."
IC: "Can you isolate which one without waiting for the full diagnostic? We need to decide whether to roll back or wait."
Lead: "We can check the upstream provider's status page and run a direct health check against their endpoint. If their endpoint is returning errors, it's theory two and a rollback won't help. If it's healthy, we prioritise theory one and roll back the config. Can do in five minutes."
IC: "Do it. Comms lead — what's the customer-facing status?"
Comms: "Status page updated at 14:38 with 'investigating payment issues'. I need the engineering team's word on whether to escalate this to 'major outage' language — that triggers our enterprise customer email notification flow."
IC: "Hold on that escalation until we have a root cause theory confirmed. Engineering — five-minute clock starts now. Everyone else: silence on the channel unless you have new information."
What is the IC's decision-making priority at this point in the incident, and what specific diagnostic shortcut does the engineering lead propose?