Advanced 6 topic areas 64+ exercises

SRE / Platform Engineer

SREs are communication hubs during incidents, reliability reviews, and on-call handoffs. This path covers the precise language of error budgets, incident timelines, post-mortem facilitation, and reliability reporting.

Start first exercise → Browse all exercises

Topics covered

SLO/SLI/SLA
Error budgets
Incident response
Post-mortem writing
Runbooks
Chaos engineering

Vocabulary spotlight

4 terms every SRE / Platform Engineer should know in English:

error budget n.

The maximum amount of unreliability permitted by an SLO over a given period

"We've burned 70% of this quarter's error budget after Monday's incident."

SLO n.

Service Level Objective — a target value for a reliability metric such as availability

"Our SLO is 99.9% availability, measured as a rolling 30-day window."

blameless post-mortem n.

An incident retrospective focused on systemic causes, not individual fault

"The blameless post-mortem revealed five contributing factors."

chaos engineering n.

Intentionally injecting faults into a system to verify resilience

"We use chaos engineering to validate that our circuit breakers actually work."

Open full glossary →

📚 Vocabulary Reference

Key terms organised by category for SRE / Platform Engineers:

Reliability Metrics

SLOSLISLAerror budgetburn ratealert thresholdMTTRMTBFavailabilityuptimefive nines

Incident Management

incidentSEV-1SEV-2on-callpageescalationincident commanderrespondermitigationresolutionblameless post-mortem

Observability

metriclogtracespandistributed tracingcardinalityp95p99dashboardalertrunbookplaybook

Reliability Concepts

toilautomationcapacity planningload sheddinggraceful degradationcircuit breakerretry with backoffchaos engineeringfault injection

Infrastructure

KubernetesPrometheusGrafanaPagerDutyDatadogclusternode poolresource limittainttoleration

Deployment Safety

canaryblue-greenrolling updaterollbackchange freezedeployment gatesmoke testreadiness probeliveness probe

Study full vocabulary modules →

Recommended exercises

SRE & Reliability Vocabulary 30 exercises

Vocabulary

Performance & Monitoring Collocations 5 exercises

Vocabulary

Writing Post-Mortems & Incident Reports 3 exercises

Writing

Write a Runbook Entry for a Common Alert 3 exercises

Writing

Write an On-Call Handover Note 3 exercises

Writing

Post-Incident Email to Stakeholders 4 exercises

Writing

Reading Error Logs & Kubernetes Events 3 exercises

Reading

Conduct an Incident Call — Assign Roles & Declare Resolved 8 exercises

Speaking

SRE Engineer Interview Questions 5 exercises

Interview

Real-world scenarios you'll practise

Writing a blameless post-mortem after an SEV-1 incident
Presenting error budget burn rate to engineering leadership
Facilitating a live incident call with multiple teams
Drafting an SLO proposal for a new service
Writing a SEV-1 customer-facing status page update — honest, calm, no jargon, regular cadence
Explaining toil reduction to management — justifying automation investment in business terms
Writing a capacity planning proposal — current usage, projections, recommended provisioning, cost estimate

🎯 Interview questions specific to this role

Practise answering these questions out loud — or in writing. Each question targets a real interviewer concern for SRE / Platform Engineers.

How do you define and enforce SLOs for a new service?
Walk me through how you would handle a SEV-1 incident from alert to post-mortem.
How do you balance feature velocity with reliability work?
What is an error budget and how have you used one in practice?
How do you communicate reliability metrics to non-technical stakeholders?

Practice all interview exercises →

Reference glossaries for SRE / Platform Engineers

Deep-dive glossaries covering terminology specific to this role:

CLI Commands Reference Cloud Services Cheat Sheet HTTP Status Codes

Browse full IT glossary →

Topics covered

Vocabulary spotlight

📚 Vocabulary Reference

Reliability Metrics

Incident Management

Observability

Reliability Concepts

Infrastructure

Deployment Safety

Recommended exercises

Real-world scenarios you'll practise

🎯 Interview questions specific to this role

Recommended reading

Reference glossaries for SRE / Platform Engineers