Advanced Listening #architecture #system-design #trade-offs

Architecture Decision Discussions

Read 3 architecture review transcripts — database technology selection, sharding analysis, and event-driven trade-offs — then answer comprehension questions about the reasoning and decisions.

How to follow an architecture discussion in English

Problem first: speakers usually state the problem before proposing solutions — identify it early
Trade-off signals: "the trade-off I'm accepting", "the risk is", "the downside" — mark the key tension
Numbers matter: specific figures (800 queries/day, 90GB, p95 120ms) anchor abstract arguments — note them
Rejection reasoning: when an option is rejected, the speaker will explain why — this is the core of the argument

0 / 3 completed

1 / 3

📄 Transcript

[Architecture decision discussion — internal engineering sync. Team lead presenting a proposed system change.]
Lead: "Okay, I want to walk through the proposal to move our search functionality from Elasticsearch to a Postgres full-text search setup. I know this sounds counterintuitive, so let me explain the reasoning.
Our current Elasticsearch cluster handles roughly 800 queries per day peak. We're not a search-first product — search is one feature among many, and it's used maybe 12% of sessions. The operational overhead of running a separate Elasticsearch cluster — the separate nodes, the schema management, the index tuning, the upgrade path — is disproportionate to how much search matters to our users.
Postgres full-text search using tsvector and tsquery, combined with pg_trgm for fuzzy matching, covers our query patterns. We've benchmarked it: for our dataset size — 2.4 million records — and our query patterns — primarily prefix and phrase matching — Postgres performs within acceptable latency bounds, under 120ms p95.
The trade-off I'm accepting: Elasticsearch has significantly richer relevance ranking, synonym handling, and multi-language tokenisation. If we ever need those — and nothing in our product roadmap currently requires them — we'd need to reintroduce a dedicated search solution. That's a deliberate architectural bet: we'd rather have operational simplicity now and pay the migration cost later if we need to.
Engineer: "What about the index rebuild time if we need to re-index the full dataset?"
Lead: "Good question. pg_trgm index build on 2.4M records in our test environment took about 11 minutes with the table locked for the last 2. We'd do a CONCURRENTLY index build in production — no table lock, roughly 18 minutes. That's acceptable for us."

What is the lead's core architectural argument, and what specific trade-off do they explicitly accept?

2 / 3

📄 Transcript

[Architecture review — senior engineer presenting database sharding proposal.]
Senior: "I want to take 15 minutes to walk through why I'm recommending against sharding the user database right now — and what I think we should do instead.
The proposal to shard comes from a real concern: our largest tenant has 4.8 million rows, and we're running some complex aggregation queries that take 3 to 4 seconds. That's a real problem. But I think the diagnosis — 'we need sharding' — is jumping past the actual investigation.
Here's what I mean. Sharding is a solution to a specific problem: horizontal write throughput exceeds what a single node can handle, or your dataset is so large that a single node can't hold it. Neither of those is our situation. Our write throughput is well within single-node capacity. Our dataset is 90GB — that's a large Postgres database, but it's nowhere near single-node limits.
What we actually have is a query performance problem. And query performance problems have a specific diagnostic path: explain analyse the slow queries, look at index usage, check for full table scans. When I ran that on the 3-second aggregation query, I found it was doing a sequential scan on a 4.8M-row table because the composite index we have doesn't include the tenant_id column in the right position.
Adding tenant_id as the leading column of that index brings the query to 80ms. That's a 40x improvement. No sharding required."
Reviewer: "What's the risk of that index change?"
Senior: "The index build itself — CONCURRENTLY, so no table lock. The risk is index size: it adds another 600MB. We have headroom. I also want to add a query timeout at the application level as a circuit breaker while the index builds."

What diagnostic method does the senior engineer use, and what is the key distinction they draw to argue against sharding?

3 / 3

📄 Transcript

[Architecture review meeting — discussing event-driven migration proposal for a monolith.]
Architect: "Before we decide, I want to make sure we're solving the right problem. The request was to introduce an event bus to decouple the order service from the inventory service. Let me test that framing.
The pain point is: when the inventory service is slow or unavailable, the order creation path blocks. That's real — we've had three incidents this quarter where inventory latency spikes caused order creation timeouts. But 'add an event bus' isn't the only solution to that pain point. Let me name the alternatives.
Option one: event bus — orders publish an event, inventory consumes it asynchronously. Decoupled. But now order status is eventually consistent — 'your order was placed' doesn't mean inventory was actually reserved when you got that confirmation. We need to handle the case where inventory later fails to reserve. That's compensation logic, saga patterns, and significant complexity.
Option two: circuit breaker on the inventory call. Orders fail fast if inventory is slow, with a fallback response. Simpler. But orders still fail when inventory is unavailable — just faster.
Option three: optimistic inventory reservation. Deduct inventory speculatively at order time, reconcile after. Works if over-commitment is acceptable — it isn't for us.
Option four: async write-through cache. Inventory service writes its state to a cache; the order service reads from the cache synchronously. The cache is invalidated asynchronously. This gives us the fast read path, tolerates inventory service latency, and keeps the order confirmation semantics synchronous.
My recommendation is option four with a TTL-based cache invalidation policy and a background reconciliation job. It solves the actual problem — latency spikes — without the consistency complexity of full event sourcing."

What does the architect's analytical approach reveal about how they evaluate architectural options? Why do they reject the event bus (option one) specifically?