SRE / Platform Engineer
SREs are communication hubs during incidents, reliability reviews, and on-call handoffs. This path covers the precise language of error budgets, incident timelines, post-mortem facilitation, and reliability reporting.
Topics covered
- SLO/SLI/SLA
- Error budgets
- Incident response
- Post-mortem writing
- Runbooks
- Chaos engineering
Vocabulary spotlight
4 terms every SRE / Platform Engineer should know in English:
The maximum amount of unreliability permitted by an SLO over a given period
"We've burned 70% of this quarter's error budget after Monday's incident."
Service Level Objective — a target value for a reliability metric such as availability
"Our SLO is 99.9% availability, measured as a rolling 30-day window."
An incident retrospective focused on systemic causes, not individual fault
"The blameless post-mortem revealed five contributing factors."
Intentionally injecting faults into a system to verify resilience
"We use chaos engineering to validate that our circuit breakers actually work."
📚 Vocabulary Reference
Key terms organised by category for SRE / Platform Engineers:
Reliability Metrics
Incident Management
Observability
Reliability Concepts
Infrastructure
Deployment Safety
Recommended exercises
Real-world scenarios you'll practise
- Writing a blameless post-mortem after an SEV-1 incident
- Presenting error budget burn rate to engineering leadership
- Facilitating a live incident call with multiple teams
- Drafting an SLO proposal for a new service
- Writing a SEV-1 customer-facing status page update — honest, calm, no jargon, regular cadence
- Explaining toil reduction to management — justifying automation investment in business terms
- Writing a capacity planning proposal — current usage, projections, recommended provisioning, cost estimate
🎯 Interview questions specific to this role
Practise answering these questions out loud — or in writing. Each question targets a real interviewer concern for SRE / Platform Engineers.
- How do you define and enforce SLOs for a new service?
- Walk me through how you would handle a SEV-1 incident from alert to post-mortem.
- How do you balance feature velocity with reliability work?
- What is an error budget and how have you used one in practice?
- How do you communicate reliability metrics to non-technical stakeholders?
Recommended reading
Reference glossaries for SRE / Platform Engineers
Deep-dive glossaries covering terminology specific to this role: