5 exercises — idioms used by SREs, DevOps engineers, and senior developers when discussing system resilience, incidents, and production reliability.
Idioms covered in this set
"Blast radius" — the scope of impact when something fails
"Fire drill" — a practice run of an emergency scenario
"Bus factor" — how many people have critical knowledge
"Happy path" — the successful execution path (no errors/edge cases)
"Cascade failure" — chain reaction of failures across services
0 / 5 completed
1 / 5
The SRE says during an incident: "We need to assess the blast radius before rolling back." What does "blast radius" mean in this context?
"Blast radius" — borrowed from explosive/military terminology. In IT, it means the scope of impact when a system fails or a dangerous change is deployed.
Usage in incident management: "What's the blast radius?" = "How many users/services are affected?" "We need to minimize the blast radius of this deployment." "By using feature flags, we reduced the blast radius to 5% of users."
Related practice — blast radius reduction: • Feature flags (roll out to 1% of users first) • Canary deployments (deploy to a small cluster first) • Circuit breakers (stop cascading failures) • Blue-green deployments (instant rollback available)
In architecture decisions: "Design the service with a small blast radius — isolate failure domains." "Microservices should have bounded blast radius: if one service fails, others keep running." "If this database goes down, what's the blast radius? Can we degrade gracefully?"
Also used in security: "The blast radius of this vulnerability is limited — only users who haven't patched to v3.2 are affected."
2 / 5
A DevOps engineer says: "The cert expiry is tomorrow. Don't panic — we've done fire drills for exactly this." What is a "fire drill" in an engineering context?
"Fire drill" — a practice simulation of an emergency to ensure the team is prepared and knows what to do when a real incident occurs.
Origin: Literal fire drills in buildings — practicing evacuation routes so that in a real fire, everyone knows the procedure without panic.
In SRE and DevOps: • Practicing certificate rotation before a cert expires • Simulating a database failover to test recovery procedures • Running GameDay (Netflix-style chaos engineering sessions) • Rehearsing incident response with a runbook
Common usage: "Let's do a fire drill next week — we'll simulate the payments service going down." "We fire drill the on-call rotation quarterly so everyone is comfortable." "After the fire drill, we found three gaps in our runbook."
Related concepts: • GameDay — a scheduled chaos engineering session • Chaos engineering — intentionally introducing failures to test resilience • Runbook — step-by-step incident response documentation • Incident retrospective / post-mortem — review after a real incident
"Fire drill vs. real fire: the fire drill is why the real fire went smoothly."
3 / 5
An engineering manager says: "The bus factor on this service is 1 — only Maria knows how it works." What does this mean, and why is it a concern?
"Bus factor" (also "truck factor") — the minimum number of team members who, if suddenly unavailable (hit by a bus), would put the project in critical jeopardy.
Bus factor = 1 is a serious risk: one person leaving, getting sick, or going on vacation could block the entire team.
Why it matters: • Knowledge hoarding creates dependency and systemic risk • Every team should aim for bus factor ≥ 2 for critical knowledge • It's not about distrust — it's about resilience
How to increase bus factor: • Pair programming to spread knowledge • Documentation and runbooks • Code reviews by non-authors • On-call rotation so multiple engineers handle incidents • Architecture decision records (ADRs)
Common usage: "The bus factor on the authentication service is dangerously low." "We need to increase the bus factor before the all-hands conference." "Write documentation to raise the bus factor — if you go on holiday, nobody should be blocked."
Note: The phrase is intentionally a bit dark/humorous — engineers use it matter-of-factly, not offensively.
4 / 5
After a major production incident, a VP asks: "How did this pass testing? Are we running enough load tests, or are we just testing for the happy path?" What is the "happy path"?
"Happy path" — the execution path through a program that assumes everything goes right: valid input, expected behavior, no errors, no edge cases.
Also called: "golden path", "sunny day scenario" Opposite: "sad path", "unhappy path", "edge case", "error path"
Why "only testing the happy path" is a problem: • Real users do unexpected things: empty inputs, null values, huge payloads, wrong file types • Real systems fail: network timeouts, database unreachable, third-party API returning 500 • Security vulnerabilities often live in non-happy paths: injection attacks exploit unhandled inputs
Examples of tests beyond the happy path: • Empty string where name is expected • File upload with a PDF named "photo.jpg" • 10,000 concurrent requests (load test) • Database returning null for a foreign key • OAuth token that has expired
In conversation: "The happy path works perfectly. Now let's test what happens when the payment gateway times out." "Our test suite only covers happy paths — we have zero sad path coverage." "The bug was on the sad path — users who had no profile picture crashed the UI."
5 / 5
A backend engineer warns: "If this service goes down, it could cause a cascade failure across the entire platform." What is a cascade failure?
"Cascade failure" (cascading failure) — a chain reaction where one system's failure causes connected systems to also fail, often amplifying the damage across the architecture.
Origin: Like a waterfall (cascade) — water flows from one level to the next; a failure in one service flows downstream to all dependent services.
Classic cascade failure scenario: 1. Service A (auth) goes slow under load 2. Service B (API) waits for Service A — threads exhausted 3. Service B starts returning timeouts 4. Service C (frontend) gets errors from B — users see failures 5. The entire platform appears down even though only auth was slow
Prevention strategies (and their vocabulary): • Circuit breaker — automatically stop calling a failing service; return a fallback instead • Timeout — never wait forever for a downstream service • Retry with backoff — retry failed requests but with increasing delays • Bulkhead — isolate failures per service; don't share thread pools • Graceful degradation — serve partial functionality when dependencies fail
In conversation: "The circuit breaker prevented it from becoming a full cascade failure." "Make every service resilient — assume all dependencies will fail eventually." "The post-mortem showed a cascade: auth latency caused a platform-wide outage."