5xx Server errors — 500 Internal Error, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout
upstream — the backend server that a proxy/load balancer forwards requests to
1 / 5
A web server access log shows: 203.0.113.15 - - [07/Apr/2026:03:14:22 +0000] "GET /api/users/profile HTTP/1.1" 401 512 "-" "Mozilla/5.0"
What does the 401 status code mean and what action should be taken?
HTTP status codes by category:
• 2xx Success — request accepted and processed • 3xx Redirection — client must take additional action (follow redirect) • 4xx Client Error — the request was wrong (the client made a mistake) • 5xx Server Error — the server failed (the server made a mistake)
The most important 4xx codes: • 400 Bad Request — malformed request syntax, invalid parameters • 401 Unauthorized — missing or invalid authentication (despite the name, it means "unauthenticated") • 403 Forbidden — authenticated but not authorized (you are logged in but lack permission) • 404 Not Found — resource does not exist • 409 Conflict — request conflicts with current state (e.g., duplicate unique field) • 422 Unprocessable Entity — valid JSON/XML but semantically invalid (validation error) • 429 Too Many Requests — rate limit exceeded
401 vs 403 — the key distinction: • 401 = "Who are you? Please authenticate." → fix: send a valid token • 403 = "I know who you are, but you can't do this." → fix: check permissions or roles
Access log format (Combined Log Format): IP - - [timestamp] "METHOD path HTTP/version" STATUS bytes "referer" "user-agent"
2 / 5
An nginx error log shows hundreds of entries like this: [error] 2026/04/07 03:14:22 [error] 1234#1234: *89341 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.0.1.5, server: api.example.com, request: "POST /api/payments HTTP/1.1", upstream: "http://172.16.0.10:8080/api/payments", host: "api.example.com"
What does "upstream timed out" mean in this nginx error log?
Understanding nginx upstream errors:
In nginx terminology, upstream refers to the backend application server that nginx proxies requests to.
The flow: 1. Client → nginx (acts as reverse proxy) 2. nginx → upstream app server (172.16.0.10:8080) 3. nginx waits for the upstream response 4. If the upstream doesn't respond within timeout: "upstream timed out"
What this error tells you: • upstream: "http://172.16.0.10:8080/api/payments" — the backend address • 110: Connection timed out — OS-level timeout (110 is the ETIMEDOUT errno code on Linux) • while reading response header — nginx connected but the backend never sent back any response headers
Possible root causes: • Backend application is overloaded / thread pool exhausted • A slow database query inside the payment handler • The backend process crashed after accepting the connection • Misconfigured proxy_read_timeout in nginx
Key nginx error log vocabulary: • upstream timed out — backend didn't respond in time • upstream connection refused — backend is not listening on that port • upstream prematurely closed connection — backend crashed mid-response • no live upstreams — all backend servers in the pool are down • connect() failed (111: Connection refused) — backend port closed
3 / 5
An API gateway log shows: {"method":"GET","path":"/api/products","status":503,"upstream_status":null,"duration_ms":30001,"error":"upstream_timeout","retry_count":3,"message":"Service Unavailable"}
What is the correct interpretation of this log entry?
503 Service Unavailable is a 5xx error — meaning the server (not the client) is responsible. Specifically, 503 means the server is temporarily unable to handle the request.
Decoding this log entry field by field: • status: 503 — what the gateway returned to the client • upstream_status: null — gateway never got a response from upstream (null = no HTTP response received) • duration_ms: 30001 — full 30 seconds elapsed (this is the gateway-level timeout) • error: "upstream_timeout" — the gateway's classification of the failure • retry_count: 3 — gateway tried 3 times (so ~10 seconds each attempt)
5xx status code vocabulary: • 500 Internal Server Error — generic server error, usually an unhandled exception • 502 Bad Gateway — upstream sent an invalid response (not a timeout — a corrupt/wrong response) • 503 Service Unavailable — server temporarily unavailable (overload or maintenance) • 504 Gateway Timeout — upstream timed out (gateway-specific version of 503) • 507 Insufficient Storage — server storage full
502 vs 503 vs 504: • 502 = upstream returned something malformed • 503 = upstream is down/unavailable • 504 = upstream took too long to respond
4 / 5
A load balancer access log shows a pattern like this over 5 minutes: 10.0.1.5 GET /api/search 200 45ms 10.0.1.5 GET /api/search 200 47ms 10.0.1.5 GET /api/search 200 52ms 10.0.1.5 GET /api/search 200 1823ms 10.0.1.5 GET /api/search 200 2941ms 10.0.1.5 GET /api/search 200 5012ms 10.0.1.5 GET /api/search 503 30000ms
What story do these log lines tell when read as a sequence?
This is one of the most valuable patterns to recognise in logs: gradual degradation before failure.
Reading the sequence: • 45ms → 47ms → 52ms: normal, stable performance • 1823ms → 2941ms → 5012ms: latency increasing rapidly — the service is struggling • 503 at 30000ms: complete failure — the service gave up
What causes this pattern: • Memory leak: heap grows until GC pauses dominate, then OOM crash • Connection pool exhaustion: fewer connections available → each request waits longer • Thread pool saturation: all threads busy → new requests queue → queue grows → timeout • Disk full: writes slow, then fail • CPU throttling: container hitting CPU limits
Why this pattern is valuable: It provides lead time before failure. An alerting rule on p95 latency > 1 second would have triggered before the 503 occurred, giving engineers time to investigate before users see errors.
Latency percentile vocabulary: • p50 (median) — 50% of requests are faster than this • p95 / p99 — "tail latency" — the slowest 5% / 1% of requests • latency spike — sudden increase in response time • latency degradation — gradual increase over time • SLO (Service Level Objective) — the target latency you commit to, e.g., "p99 < 500ms"
5 / 5
An engineer asks: "Our API returns 404 for a user ID that definitely exists in the database. I checked the database — the user is there. What could cause a legitimate 404?" Which explanation is most likely?
A 404 from an API does not always mean the resource doesn't exist in the database. Application code can return 404 for many reasons.
The 4 most common causes of "ghost 404s":
(a) Route mismatch: • API expects GET /users/550e8400-e29b-41d4-a716-446655440000 (UUID) • Client sends GET /users/42 (integer) • Router doesn't match → 404 (resource not "found" by the router)
(b) Soft delete / business logic: Many APIs implement soft deletes: records have a deleted_at or status: "deactivated" column. The row exists in the database but the application returns 404 to hide it from callers. This is intentional API design.
(c) Environment mismatch: The most embarrassing debugging mistake: checking production DB while the API call hits staging. Always verify which environment you're checking.
(d) Cache poisoning / stale 404: A CDN or API gateway may have cached a 404 response. Even after the resource is created, the cache returns the old 404. Check: X-Cache: HIT header means a cached response was served.
How to diagnose: 1. Check the full request URL and headers in the log 2. Look at what the application code does (not just the DB) 3. Compare the environment (request Host header vs database being checked) 4. Check for a X-Cache or CF-Cache-Status header in the response