Upper-intermediate 11 terms

Incident Command & Response

ICS terminology for software incidents: roles (Incident Commander, Operations Lead, Comms Lead, Scribe), SEV declaration, assuming command, the incident timeline, and standing down.

  • Incident Commander (IC) /ˈɪnsɪdənt kəˈmɑːndər/

    The role responsible for overall coordination of an incident response — declaring severity, assigning roles, making decisions about escalation and communication, and calling the all-clear. The IC manages the incident, not the technical fix.

    "As Incident Commander, I'm declaring this a SEV-1. Sarah, you're Operations Lead — own the technical investigation. Marcus, you're Comms Lead — draft stakeholder updates every 15 minutes. I'll coordinate with Engineering leadership. Everyone: updates in the incident channel, not Slack DMs."
  • Operations Lead /ˌɒpəˈreɪʃənz liːd/

    The incident role responsible for technical investigation and remediation — directing the engineering response, coordinating the on-call engineers working the fix, and providing technical status updates to the Incident Commander.

    "Operations Lead update to IC: root cause isolated to the Redis connection pool exhaustion on the payment service. Fix options: (1) restart the service — estimated 3 minutes, risk of continued exhaustion; (2) roll back to v2.4.1 — estimated 8 minutes, resolves the root cause. Recommending option 2."
  • Comms Lead /kɒmz liːd/

    The incident role responsible for all external and internal written communications during an incident — drafting status page updates, stakeholder notifications, and executive summaries, and managing the communication timeline.

    "Comms Lead posting status update 3: 'We have identified the root cause of the elevated error rates. A fix is being deployed. ETA to resolution: 8 minutes. Affected functionality: payment processing. Orders placed during this window will be retried automatically. Next update in 10 minutes.'"
  • Scribe /skraɪb/

    The incident role responsible for maintaining the real-time incident timeline — logging actions taken, decisions made, status changes, and timestamps in the incident document as the event unfolds.

    "Scribe log: 14:22 IC declared SEV-1. 14:24 Ops Lead identified Redis connection pool exhaustion. 14:27 Rollback to v2.4.1 initiated. 14:35 Rollback complete, error rate declining. 14:42 Error rate below 0.1%, monitoring. 14:55 IC declared all-clear, incident stood down."
  • SEV (Severity) Declaration /ˈsevərɪti/

    The formal classification of an incident's impact and urgency. Severity 1: complete service outage or critical data loss. Severity 2: major functionality affected, significant user impact. Severity 3: minor issue with workaround. Drives the response team size and escalation path.

    "I'm declaring SEV-2: checkout is working but the order confirmation emails are failing for 40% of orders. No revenue impact, but customer trust impact at scale. SEV-2 response: Ops Lead + 2 engineers, 15-minute update cadence, VP Engineering notified asynchronously."
  • Assuming Command /əˈsjuːmɪŋ kəˈmɑːnd/

    The formal verbal transfer of the Incident Commander role from one person to another during a long-running incident — done explicitly and acknowledged by the team to avoid confusion about who is making decisions.

    "I'm assuming command from Tom. Tom, please brief me: current status, active working threads, last comms update, and any decisions pending. Team: Emma is now IC. All comms and requests go to Emma. Tom, please document the handover in the incident channel."
  • Standing Down /ˈstændɪŋ daʊn/

    The formal declaration by the Incident Commander that an incident is resolved and the incident response team is released from active duty — after confirming the fix is stable, metrics are normal, and monitoring is in place.

    "IC standing down the SEV-2 at 16:45. Error rate at 0.02% (baseline), all order confirmation emails processing normally. 30-minute tail monitoring in place. Ops Lead: please keep the on-call rotation alert. We'll schedule the post-mortem for Thursday at 14:00."
  • War Room /wɔːr ruːm/

    A dedicated (virtual or physical) space where an incident response team convenes — a dedicated Zoom call, Slack channel, or conference room — separated from normal operations to maintain focus and clear communication.

    "The war room for SEV-1 incidents is a dedicated Zoom bridge with a fixed link. The Zoom waiting room is disabled during active incidents — anyone with the link can join. All incident decision-making happens in the war room, not in individual Slack threads."
  • Incident Timeline /ˈɪnsɪdənt ˈtaɪmlaɪn/

    A chronological log of significant events during an incident: alert firing, IC declaration, diagnosis milestones, actions taken, resolution. The primary artefact for post-mortem analysis and communication to stakeholders.

    "The incident timeline is auto-populated from the incident channel using our bot that timestasmp any message tagged #action or #finding. The scribe supplements with decisions and status changes. The post-mortem template ingests the timeline automatically."
  • All-Clear /ˌɔːl ˈklɪər/

    The IC's declaration that an incident is fully resolved — metrics have returned to baseline, the fix is confirmed stable, monitoring is in place for regression, and the incident response team is stood down.

    "IC all-clear at 15:42: error rate 0.03% (pre-incident baseline 0.04%), p99 latency 180ms (baseline 175ms), payment throughput normal. Fix confirmed stable for 20 minutes. Post-mortem scheduled for Friday. Thank you team — excellent response."
  • Post-Mortem / Post-Incident Review /ˈpoʊst ˈmɔːrtəm/

    A structured blameless retrospective after an incident: root cause analysis, timeline reconstruction, contributing factors, and action items to prevent recurrence. The output is a written document shared across engineering.

    "The post-mortem confirmed the root cause was a connection pool configuration parameter that wasn't included in the deployment checklist. Three action items: (1) add connection pool config to the deployment checklist, (2) add a pre-deployment connection pool health check, (3) add a connection pool exhaustion alert to the runbook."