📊 Reading CloudWatch & Grafana Dashboards
Interpret real-world monitoring data and communicate about it in professional English. Five exercises covering latency percentiles, throughput, throttles, spikes, and incident description.
0 of 5 completed
Exercise 1 of 5
A CloudWatch dashboard shows:
RequestCount: 1,247 req/min (baseline: ~1,200)
Latency p50: 48ms | p95: 210ms | p99: 1,840ms
ErrorRate: 0.12% (normal: < 0.05%)
CPU Utilisation: 34%
⚠ Alarm: "HighP99Latency" — TRIGGERED at 14:03 UTC
How would you describe this situation in a team Slack message?
What makes option C correct:
• States the alarm name and trigger time
• Explains what p99 means in plain language ("slowest 1%")
• Notes what is NOT affected (volume, CPU) — helps narrow down the cause
• Proposes a hypothesis (slow code path, DB/external call)
• Avoids broadcasting raw numbers without interpretation
Key vocabulary:
• p50 / p95 / p99 — percentile latency: p99 = 99% of requests complete within this time
• alarm threshold — a configured value that triggers a notification when exceeded
• baseline — the normal, expected measurement under typical conditions
• CPU utilisation — the percentage of compute capacity currently in use
• error rate — the proportion of requests that result in an error
Useful phrases:
• "We have a latency spike on the p99 — investigating now."
• "The alarm fired at 14:03 UTC. p50 looks healthy, so this is likely affecting a narrow segment of requests."
• "CPU is under 40%, so this isn't a resource saturation issue."
Exercise 2 of 5
A Grafana chart shows a sharp spike at 03:14 UTC:
Before 03:14: CPU ~22%, Memory ~4.1 GB, DB connections: 8
At 03:14: CPU → 94%, Memory → 7.9 GB, DB connections → 247
At 03:22: CPU → 23%, Memory → 4.3 GB, DB connections → 11
Which description of this data is most accurate for an incident report?
Option A is correct because it:
• Names all three affected metrics explicitly
• Gives exact start and end times from the chart
• Calculates the duration (03:14–03:22 = 8 minutes)
• Uses professional language ("spiked", "returned to normal")
• Avoids vague language ("maximum", "issues")
Reading this chart:
• The spike pattern (sharp rise, sustained peak, sharp recovery) is typical of a batch job, scheduled task, or traffic burst
• DB connections jumping from 8 to 247 suggests a connection pool exhaustion event — likely a query without a timeout or a loop that opened connections without releasing them
• Memory roughly doubling alongside DB connections suggests large result sets being held in memory
Key vocabulary:
• spike — a sudden, brief increase in a metric
• sustained peak — a high value that stays elevated for a period
• connection pool — a set of pre-opened DB connections reused across requests
• connection pool exhaustion — all connections in the pool are in use; new requests queue or fail
• throughput — the amount of work done per unit of time
Incident report phrases:
• "The incident window was 03:14–03:22 UTC (8 minutes)."
• "All three key metrics — CPU, memory, and DB connections — returned to baseline by 03:22."
Exercise 3 of 5
A DevOps engineer describes a Grafana alert to the team:
"We had a throughput of 50,000 events per second at peak with a p95 processing latency of 380ms. After the 14:00 deploy, throughput dropped to 31,000 events/sec and p95 climbed to 2,100ms."
What does this description tell you about the impact of the 14:00 deploy?
Option A is correct because it:
• Quantifies the throughput drop as a percentage: (50,000 − 31,000) ÷ 50,000 = 38%
• Quantifies the latency increase as a percentage: (2,100 − 380) ÷ 380 ≈ 453%
• Calls it a "performance regression" — the correct technical term
• Proposes a hypothesis ("new code slowed down processing")
Why option D is insufficient:
Raw differences (20K events, 1,720ms) communicate less clearly than relative changes. A 20K drop means very different things at different baselines.
Key vocabulary:
• throughput — events/requests/messages processed per unit of time
• performance regression — a deploy or change that makes the system slower or less efficient
• events per second (EPS) — common unit for stream processing, log ingestion, message queues
• p95 latency — 95% of requests complete within this time; 5% take longer
• deploy (also: release, rollout, push) — deploying new code to production
Useful phrases for this situation:
• "The deploy introduced a performance regression."
• "We need to roll back — throughput dropped 38% and p95 latency is now 5× worse."
• "Let's compare the flamegraphs before and after the 14:00 deploy to identify the bottleneck."
Exercise 4 of 5
A CloudWatch metric description reads:
"Lambda function 'process-orders' — Duration (ms):
Average: 234ms | Maximum: 12,400ms | Throttles: 847 | ConcurrentExecutions: 493/500 (limit)"
At a Monday standup, how would you describe what happened?
Reading this CloudWatch output:
ConcurrentExecutions: 493/500
Lambda has a concurrency limit — the maximum number of instances running simultaneously in your AWS account/region. At 493/500, we're nearly at the cap.
Throttles: 847
When the limit is hit, Lambda throttles additional invocations — it returns a 429 (TooManyRequestsException) or queues them. 847 throttles means 847 invocations were rejected or delayed.
Maximum duration: 12,400ms vs Average: 234ms
An average of 234ms with a max of 12.4s (53× higher) suggests some invocations waited in queue behind throttled requests — they weren't actually slow to execute, they were slow to start.
Key vocabulary:
• throttle — to intentionally limit the rate or volume of requests; a throttle event is when a request is rejected due to limits
• concurrency limit — maximum simultaneous executions (AWS Lambda, API Gateway, etc.)
• cold start — a Lambda invocation that requires spinning up a new execution environment (adds latency)
• invocation — a single execution of a Lambda function
• queue depth — the number of pending requests waiting to be processed
Actions to suggest:
• "Request a reserved concurrency increase for this function."
• "Add an SQS queue in front to absorb bursts without dropping requests."
• "Profile the function to see if we can reduce duration and free slots faster."
Exercise 5 of 5
A week-long Grafana chart for an API service shows this pattern:
Mon–Thu: ~3.2 req/sec, p95: ~145ms, error rate: 0.01%
Fri 18:00–Mon 08:00: ~0.4 req/sec, p95: ~130ms, error rate: 0.01%
Mon 08:15: req/sec 3.2 → 14.8, p95: 145ms → 890ms, error rate: 0.01% → 1.4%
Which analysis is correct?
Option C is the strongest analysis because it:
• Identifies the structural pattern (weekday/weekend cycle) — showing understanding of the full week, not just the spike
• Measures the spike relative to weekday baseline (4.6×), not the weekend baseline
• Converts all changes to multipliers for clear communication
• Connects the symptoms to known causes (cold cache, connection pool, pre-warming)
Calculating the multipliers:
• Traffic: 14.8 ÷ 3.2 = 4.6× the weekday baseline
• Latency: 890 ÷ 145 = 6.1× the weekday baseline
• Error rate: 1.4% ÷ 0.01% = 140× the weekday baseline
Key vocabulary:
• traffic cycle / diurnal pattern — regular variation in traffic by time of day or day of week
• cold cache — a cache that was cleared or expired over the weekend; first requests must hit the database
• pre-warming — sending artificial traffic before peak hours to populate caches and initialise connection pools
• Monday morning effect / thundering herd — a surge of traffic when users return after a weekend
• connection pool exhaustion — all available DB connections consumed simultaneously
Recommended action:
"Set up a pre-warming cron job to run at 07:45 UTC on weekdays to populate the cache and initialise connection pools before the 08:00 traffic peak."