English for SRE Engineers: SLO, SLA, Error Budget, and Incident Language
The professional English vocabulary and communication patterns for Site Reliability Engineers: SLI/SLO/SLA, error budgets, incident command, post-mortems, and reliability reporting.
Site Reliability Engineers have one of the most language-rich roles in software. An SRE writes runbooks, leads live incident calls, presents error budget burn rates to leadership, facilitates post-mortem meetings, and defines SLOs in contractual language. Each of these requires precise, professional English — vague language in an incident response can cost minutes; unclear SLO definitions lead to disputes.
This guide covers the core SRE vocabulary and the communication patterns for each context.
The reliability measurement vocabulary
SLI, SLO, and SLA
These three terms form the language of reliability commitments. Non-native speakers often confuse them because they sound similar. The distinction is critical.
SLI — Service Level Indicator An SLI is a specific metric you measure. It is a number that quantifies one aspect of a service’s reliability.
“Our primary SLI for the checkout API is request success rate — the percentage of requests that return a 2xx or non-500 response.”
“We track three SLIs for the payments service: availability (success rate), latency (P99 < 200ms), and freshness (payment status updates within 30 seconds).”
SLO — Service Level Objective An SLO is a target value (or range) for an SLI over a time window. It is an internal goal — your engineering standard.
“Our SLO is 99.9% availability measured as a rolling 28-day window.”
“The latency SLO: 99th percentile response time must remain below 500ms.”
“We have a tiered SLO: 99.95% for paid customers and 99.5% for free-tier users — the two groups are served by separate infrastructure.”
SLA — Service Level Agreement An SLA is a contractual commitment to external customers, usually with financial penalties for breach. Your SLA is typically less demanding than your SLO — the gap is your safety buffer.
“Our SLA promises 99.9% monthly availability. Our internal SLO target is 99.95% — we maintain a 0.05% buffer above the contractual commitment.”
“If we breach the SLA, customers are credited according to the downtime table in section 3 of the contract.”
“The SLA only covers the API endpoints specified in Appendix A — our internal tooling is out of scope.”
Error budget
The error budget is the amount of unreliability your SLO permits in a given period. It is the gap between 100% and your SLO target.
| SLO | Error budget per 30 days |
|---|---|
| 99% | 7.2 hours |
| 99.5% | 3.6 hours |
| 99.9% | 43.2 minutes |
| 99.95% | 21.6 minutes |
| 99.99% | 4.3 minutes |
Talking about error budget consumption:
“We’ve burned 65% of this month’s error budget in the first two weeks — mostly from the database failover incident on the 8th.”
“At the current burn rate, we’ll exhaust the error budget by the 22nd of the month.”
“After Monday’s incident, we have only 8 minutes of error budget remaining for the rest of the quarter.”
Decision language around error budgets:
“Error budget policy: if we exhaust 50% of the error budget before the midpoint of the period, we freeze non-critical deployments and focus on reliability improvements.”
“We have surplus error budget this quarter — we can afford to run the migration experiment next week.”
“The product team wants to launch a new feature that requires a risky database schema change. We don’t have enough error budget to absorb the risk — we’ve pushed back the launch date.”
Incident communication
Incidents are the highest-stakes communication context for SREs. Language must be clear, calm, and structured — both during the incident and in the post-mortem.
Incident severity levels
Most teams use a numbered severity scale. Use these consistently:
“This is a SEV-1 — we have complete unavailability of the payment service and are losing revenue. All engineers on the incident response team are paged.”
“SEV-2 — degraded performance on the search API, 30% error rate. Some users are affected but the service is partially functional.”
“SEV-3 — a non-critical background job is failing. No user-facing impact, but we need to investigate before the daily data export tonight.”
Incident command language
During a live incident call, clear role assignment and communication structure prevent chaos.
Opening an incident:
“I’m opening an incident for the authentication service outage. Current status: 100% of login attempts are failing. Incident commander: me. Communication lead: [name]. I need an engineer from the auth team and one from infrastructure on this call now.”
Assigning tasks:
“Can you take the network side — check if this is isolated to one AZ or cross-regional?”
“[Name], can you be the communication lead? Every 15 minutes post an update to the #incidents channel.”
“I need someone to check the database connection pool saturation — that was the cause last time we saw errors like these.”
Requesting status updates:
“What are you seeing on the database side? Any unusual load?”
“Where are we on the rollback? When can we expect the previous version to be serving traffic?”
“Can you confirm: is this affecting all regions, or only us-east-1?”
Keeping stakeholders informed (external communication):
“Update 14:35 UTC: We are investigating elevated error rates on the authentication service. ETA for resolution is unknown at this time. We will update in 15 minutes.”
“Update 14:50 UTC: Root cause identified — a configuration change deployed at 14:10 UTC caused the connection pool to exhaust. We are rolling back now. ETA: 10 minutes.”
“Update 15:05 UTC: The rollback is complete and the error rate has returned to normal. The authentication service is fully operational. We will file a post-mortem within 48 hours.”
Closing an incident:
“The incident is resolved as of 15:05 UTC. Total duration: 55 minutes. I’ll be writing up the post-mortem and will share the draft for review by end of tomorrow.”
Incident timeline language
The incident timeline is a critical artifact — it must be precise about what happened when.
“14:10 UTC — Configuration change deployed to production (deployment ID: 8294)”
“14:23 UTC — First alert: auth-service error rate exceeded 5% SLO threshold”
“14:28 UTC — Incident declared, incident commander assigned”
“14:31 UTC — Initial hypothesis: connection pool exhaustion under investigation”
“14:48 UTC — Root cause confirmed: the new connection pool size (10) was insufficient for peak load. Previous value: 50.”
“14:52 UTC — Rollback initiated”
“15:05 UTC — Rollback complete, error rate returning to baseline”
“15:12 UTC — Service fully recovered, incident resolved”
Useful timeline adverbs and connectors:
- “Initially, the on-call engineer suspected…”
- “Subsequently, it became clear that…”
- “At this point, the team shifted focus to…”
- “In parallel, the communication lead was updating the status page.”
- “The root cause was not identified until 14:48…”
- “Shortly after the rollback began…”
Post-mortem writing
A post-mortem (also called a retrospective, incident review, or learning review) is a written analysis of what went wrong, why, and what will be done to prevent recurrence. The tone is blameless — it analyses systems and processes, not individuals.
Blameless language
Blaming (avoid):
❌ “The engineer deployed the wrong configuration.”
❌ “The developer forgot to test the connection pool settings.”
Blameless (use):
✅ “The configuration value was changed without a corresponding change in the deployment checklist.”
✅ “The connection pool settings are not currently included in the pre-deploy validation suite.”
✅ “There was no automated check that would have caught the misconfiguration before production.”
The shift is from who did something to what conditions made it possible. A single mistyped value causing a 55-minute outage indicates a systemic gap in validation — not a personal failure.
Post-mortem structure and language
Summary (2-3 sentences at the top):
“On March 18, 2026, the authentication service experienced a 55-minute outage caused by a configuration change that reduced the database connection pool size. The incident affected 100% of users attempting to log in. This post-mortem reviews the contributing factors, timeline, and action items.”
Impact section:
“The incident affected all users attempting to authenticate between 14:10 and 15:05 UTC — approximately 55 minutes.”
“Based on login attempt volume at that time of day, approximately 12,000 login attempts failed.”
“No data was lost. No customer data was compromised. The service returned to normal operation without data recovery.”
Root cause:
“The root cause was a configuration change that set the database connection pool size to 10, down from the previous value of 50. At peak load (~400 requests/second), the reduced pool caused all available connections to be exhausted, causing the service to reject new connections with a timeout error.”
Contributing factors (systemic):
“The connection pool setting is configurable through the infrastructure-as-code repository, which does not have a validation step that checks connection pool sizing against expected load.”
“The staging environment does not replicate production traffic volume, so the misconfiguration did not surface before deployment.”
“The monitoring alert threshold for connection pool saturation was set to 95%, which was not reached until the service was already failing.”
Action items:
Write action items as specific, assignable tasks with owners and due dates:
“Add connection pool saturation to the pre-deploy checklist: minimum pool size must be justified in the PR if changed. Owner: [name]. Due: March 25.”
“Update the staging environment load generator to simulate peak production traffic. Owner: [name]. Due: April 5.”
“Lower the connection pool saturation alert threshold from 95% to 70%. Owner: [name]. Due: March 20.”
What makes a good post-mortem
The five-why analysis is a common technique for getting to root causes:
- Why did users fail to log in? → The auth service returned errors.
- Why did the auth service return errors? → Database connections were exhausted.
- Why were database connections exhausted? → The pool size was set to 10 instead of 50.
- Why was the pool size set incorrectly? → A configuration change was made without validating against load requirements.
- Why was there no validation? → There is no automated check for connection pool sizing in the deployment pipeline.
Result: the action item is not “check connection pool settings before deploying” (a manual process) — it is “add automated validation of connection pool sizing in CI/CD.” Systems solutions, not people solutions.
Runbook language
A runbook is a step-by-step operational guide for executing a common task or responding to an alert. Clarity and precision are critical — runbooks are often read under pressure during an incident.
Runbook style principles:
- One action per step
- Use imperative mood (“Run the following command”, not “The following command should be run”)
- Include expected output
- Include what to do if the step fails
Example runbook excerpt:
Step 3: Verify current connection pool configuration
Run the following command to retrieve the current pool size from the parameter store:
aws ssm get-parameter --name "/prod/db/connection_pool_size"Expected output: the value should be
50(or higher). If the value is below20, proceed to Step 4 (emergency override). If the parameter does not exist, escalate to the infrastructure team.Step 4: Apply emergency configuration override
If the connection pool size is confirmed to be insufficient, apply the override using the following command. This change takes effect immediately without a deployment:
aws ssm put-parameter --name "/prod/db/connection_pool_size" --value "50" --overwriteAfter applying, verify the service recovery by monitoring the connection pool saturation metric in Grafana (dashboard: “DB Connection Pool — Production”). Allow 2 minutes for the service to normalise before declaring recovery.
Reliability reporting vocabulary
SREs regularly present reliability data to engineering leadership and business stakeholders. Here are phrases for different situations:
Reporting a good period:
“We maintained 99.97% availability this month — well above our 99.9% SLO. Error budget consumption was 15%.”
“Incident count is down 40% quarter-over-quarter, and mean time to resolution (MTTR) improved from 47 minutes to 22 minutes.”
Reporting a difficult period:
“This month’s availability dropped to 99.85% — below our 99.9% SLO target. The primary driver was three incidents in the payments service, totalling 65 minutes of downtime.”
“We burned through the full Q1 error budget in the first six weeks. The team is in reliability focus mode: no new feature deployments until we restore the budget.”
Presenting trends:
“The rolling 28-day availability trend has been improving since we implemented the circuit breaker pattern in February.”
“P99 latency has been consistently within SLO, but P99.9 exceeds the threshold approximately 2% of days — there is a tail latency problem we have not yet diagnosed.”
Setting expectations:
“The database migration planned for next month carries a risk of 15-minute downtime. We have sufficient error budget to absorb this if we are within the maintenance window.”
“If the migration goes as planned, our error budget position at month end will be approximately 60% remaining.”
Quick SRE vocabulary reference
| Term | Meaning |
|---|---|
| SLI | Service Level Indicator — what you measure |
| SLO | Service Level Objective — your internal target |
| SLA | Service Level Agreement — your contractual commitment |
| error budget | Allowed unreliability = 100% − SLO |
| burn rate | Speed at which error budget is being consumed |
| toil | Manual, repetitive, automatable operational work |
| MTTR | Mean Time to Recovery/Resolution |
| MTTF | Mean Time to Failure |
| MTTD | Mean Time to Detect |
| runbook | Step-by-step guide for a repeatable operation or alert response |
| blameless post-mortem | Root-cause analysis focused on systems, not individuals |
| incident commander | Person coordinating the response during a live incident |
| SEV-1/2/3 | Severity levels for incidents (SEV-1 = critical) |
| on-call rotation | Schedule determining which engineer responds to overnight alerts |
| chaos engineering | Intentionally injecting failures to test resilience |
| canary deployment | Gradual traffic rollout to reduce blast radius |
| blast radius | Scope of impact if a change or failure occurs |
The precision required in SRE language reflects the precision required in the systems SREs operate. Every word in an SLO definition will eventually be contested; every step in a runbook will eventually be executed at 3am. Write with that reader in mind.