5 exercises — practice structuring strong English answers to SRE interview questions: SLOs and error budgets, incident response, toil reduction, observability vs monitoring, and high-availability design.
How to structure SRE interview answers
SLO questions: define SLI → SLO → error budget → link to engineering prioritisation
Incident questions: contain first, diagnose second → named phases → communication cadence → blameless post-mortem
Toil questions: use the Google SRE definition (manual, repetitive, automatable, no lasting value) → give specific examples with impact
Observability questions: known unknowns (monitoring) vs unknown unknowns (observability) → three pillars: metrics, logs, traces
HA questions: address multiple layers → name patterns (bulkhead, graceful degradation, stateless) → invoke CAP theorem for distributed systems
0 / 5 completed
1 / 5
The interviewer asks: "What is an SLO, and how do you go about setting one?" Which answer best demonstrates SRE depth?
Option B is the strongest: it gives the precise definition (ratio of good events over a time window), uses the correct acronym hierarchy (SLI → SLO → error budget), provides a concrete example, explains the process for setting one, and connects SLOs to engineering prioritisation via the error budget. The SRE vocabulary hierarchy: SLI (Service Level Indicator) — the actual metric being measured (e.g., request success rate, latency p99). SLO (Service Level Objective) — the internal target for that metric (e.g., 99.9% of requests succeed). SLA (Service Level Agreement) — an external, contractual commitment, typically less strict than the SLO. If the SLO is 99.9%, the SLA might guarantee 99.5%. Error budget — 100% minus SLO. "We have 0.1% of requests we're allowed to fail. If we've burned it all, we freeze features and work on reliability." The error budget concept is what distinguishes a true SRE answer from a DevOps-monitoring answer — it directly connects reliability targets to engineering decision-making. Option C mentions the SLA distinction but doesn't explain SLIs or error budgets. Option D is a reasonable answer but misses error budgets and stakeholder negotiation.
2 / 5
The interviewer asks: "Walk me through how you would run an incident response for a P1 — your main API is returning 500s for all users." Which answer demonstrates the clearest incident management process?
Option C is the strongest: it gives the complete incident lifecycle with named phases, specific time targets (5-minute communication, 15–30 minute updates), names the tooling (PagerDuty, observability stack, status page), mentions the "contain first, diagnose second" principle, and ends with the post-mortem step. The priority order in incident response is critical: Contain (stop the bleeding) BEFORE Diagnose (find root cause). Many junior engineers make the mistake of diving into diagnosis while users are still fully affected. Key incident response vocabulary: P1/P2/P3 — severity levels (P1 = complete outage). Incident commander — person coordinating response, not necessarily fixing it. On-call rotation — who gets paged. Rollback — revert to last known good version. Circuit breaker — automatically stop requests to a failing dependency. Status page — public (or internal) communication channel. Blameless post-mortem — focuses on systemic failures, not individual blame. MTTR (Mean Time to Recovery) — key SRE metric for incident response speed. Option B is good but less structured. Option D is too vague to demonstrate real incident response ownership.
3 / 5
The interviewer asks: "What is toil, and how have you worked to reduce it?" Which answer best demonstrates SRE thinking?
Option B is the strongest because it gives the precise SRE definition of toil with all its components, distinguishes toil from overhead and from engineering work (a nuance most candidates miss), gives three concrete examples with measurable impact, and references the Google SRE guideline (50% cap). The Google SRE definition of toil has specific properties — an answer that lists all of these shows genuine familiarity with the SRE literature: Manual — requires human action. Repetitive — done repeatedly, not once. Automatable — could be done by a machine. Tactical — interrupt-driven, reactive. No lasting value — the service is the same after as before. Scales with service growth — if traffic doubles, toil doubles. The distinction between toil and overhead matters: meetings, on-calls, reading documentation are overhead — not toil. Engineering work that builds automation to eliminate future toil is the opposite of toil. In interviews, name specific automation you built: "I wrote a script that automated X, eliminating Y hours of manual work per month." Options C and D are reasonable but lack the specific definition components that demonstrate SRE book knowledge.
4 / 5
The interviewer asks: "How do you approach observability — and what's the difference between observability and monitoring?" Which answer best demonstrates depth?
Option B is the strongest: it makes the precise conceptual distinction (known unknowns vs unknown unknowns), explains all three pillars with their specific purpose, and grounds the answer in a concrete implementation (OpenTelemetry). The monitoring vs. observability distinction for SRE interviews: Monitoring — watching predefined metrics against thresholds to detect known failure modes. "Alert when error rate > 1%." Observability — the ability to understand any system state by querying its external outputs, including failure modes you didn't anticipate. The "known unknowns vs unknown unknowns" framing is the professional SRE articulation. The three pillars: Metrics — time-series aggregates (Prometheus, Datadog). High volume, low cardinality. Good for dashboards and alerts. Logs — structured event records (Loki, Splunk, CloudWatch). High volume, arbitrary context. Good for diagnosis. Traces — distributed request flows across services (Jaeger, Tempo, Honeycomb). Low volume, high context. Good for latency diagnosis and dependency mapping. OpenTelemetry — the CNCF standard for instrumenting services to emit all three signals with a single SDK — vendor-neutral. Smart candidates mention it to show they're following industry standards.
5 / 5
The interviewer asks: "How do you design a system for high availability, and what trade-offs does that involve?" Which answer best demonstrates systems thinking?
Option B is the strongest: it addresses HA systematically through multiple layers, precisely names the patterns (redundancy, graceful degradation, bulkheads, stateless services), acknowledges the cost and complexity trade-offs, invokes CAP theorem to show distributed systems depth, and ends with the most important design question ("what is the acceptable degraded state?"). Key HA vocabulary and patterns: Single point of failure (SPOF) — any component whose failure takes down the system. Redundancy — duplicate components so failure of one doesn't stop the service. Multi-AZ / Multi-region — deployment across failure domains. Graceful degradation — the system continues at reduced functionality when dependencies fail. Bulkhead pattern — isolate failures between components (circuit breaker is a bulkhead). Stateless services — services that don't hold session state locally, enabling any instance to serve any request (critical for failover). CAP theorem — in a network partition, a distributed system must choose between Consistency and Availability. HA systems choose Availability (AP systems). Options C and D are solid technical answers but miss the graceful degradation and CAP discussion that demonstrate SRE-level thinking vs. DevOps-level thinking.