Key monitoring vocabulary:
  • p50/p95/p99 — latency percentiles
  • throughput — requests or events per second
  • throttle — request rejected due to rate/concurrency limit
  • baseline — normal expected value
  • spike — sudden brief increase in a metric
  • alarm threshold — value that triggers a notification

0 of 5 completed

Exercise 1 of 5

A CloudWatch dashboard shows:

RequestCount: 1,247 req/min (baseline: ~1,200)
Latency p50: 48ms | p95: 210ms | p99: 1,840ms
ErrorRate: 0.12% (normal: < 0.05%)
CPU Utilisation: 34%

⚠ Alarm: "HighP99Latency" — TRIGGERED at 14:03 UTC


How would you describe this situation in a team Slack message?

Exercise 2 of 5

A Grafana chart shows a sharp spike at 03:14 UTC:

Before 03:14: CPU ~22%, Memory ~4.1 GB, DB connections: 8
At 03:14: CPU → 94%, Memory → 7.9 GB, DB connections → 247
At 03:22: CPU → 23%, Memory → 4.3 GB, DB connections → 11


Which description of this data is most accurate for an incident report?

Exercise 3 of 5

A DevOps engineer describes a Grafana alert to the team:

"We had a throughput of 50,000 events per second at peak with a p95 processing latency of 380ms. After the 14:00 deploy, throughput dropped to 31,000 events/sec and p95 climbed to 2,100ms."

What does this description tell you about the impact of the 14:00 deploy?

Exercise 4 of 5

A CloudWatch metric description reads:

"Lambda function 'process-orders' — Duration (ms):
Average: 234ms | Maximum: 12,400ms | Throttles: 847 | ConcurrentExecutions: 493/500 (limit)"


At a Monday standup, how would you describe what happened?

Exercise 5 of 5

A week-long Grafana chart for an API service shows this pattern:

Mon–Thu: ~3.2 req/sec, p95: ~145ms, error rate: 0.01%
Fri 18:00–Mon 08:00: ~0.4 req/sec, p95: ~130ms, error rate: 0.01%
Mon 08:15: req/sec 3.2 → 14.8, p95: 145ms → 890ms, error rate: 0.01% → 1.4%


Which analysis is correct?