Read 3 architecture review transcripts — database technology selection, sharding analysis, and event-driven trade-offs — then answer comprehension questions about the reasoning and decisions.
How to follow an architecture discussion in English
Problem first: speakers usually state the problem before proposing solutions — identify it early
Trade-off signals: "the trade-off I'm accepting", "the risk is", "the downside" — mark the key tension
Numbers matter: specific figures (800 queries/day, 90GB, p95 120ms) anchor abstract arguments — note them
Rejection reasoning: when an option is rejected, the speaker will explain why — this is the core of the argument
0 / 3 completed
1 / 3
📄 Transcript
[Architecture decision discussion — internal engineering sync. Team lead presenting a proposed system change.]
Lead: "Okay, I want to walk through the proposal to move our search functionality from Elasticsearch to a Postgres full-text search setup. I know this sounds counterintuitive, so let me explain the reasoning.
Our current Elasticsearch cluster handles roughly 800 queries per day peak. We're not a search-first product — search is one feature among many, and it's used maybe 12% of sessions. The operational overhead of running a separate Elasticsearch cluster — the separate nodes, the schema management, the index tuning, the upgrade path — is disproportionate to how much search matters to our users.
Postgres full-text search using tsvector and tsquery, combined with pg_trgm for fuzzy matching, covers our query patterns. We've benchmarked it: for our dataset size — 2.4 million records — and our query patterns — primarily prefix and phrase matching — Postgres performs within acceptable latency bounds, under 120ms p95.
The trade-off I'm accepting: Elasticsearch has significantly richer relevance ranking, synonym handling, and multi-language tokenisation. If we ever need those — and nothing in our product roadmap currently requires them — we'd need to reintroduce a dedicated search solution. That's a deliberate architectural bet: we'd rather have operational simplicity now and pay the migration cost later if we need to.
Engineer: "What about the index rebuild time if we need to re-index the full dataset?"
Lead: "Good question. pg_trgm index build on 2.4M records in our test environment took about 11 minutes with the table locked for the last 2. We'd do a CONCURRENTLY index build in production — no table lock, roughly 18 minutes. That's acceptable for us."
What is the lead's core architectural argument, and what specific trade-off do they explicitly accept?
Context-specific decision: operational complexity is disproportionate to search's product importance (12% of sessions). Benchmark validated. Trade-off explicitly named and accepted.
Architecture vocabulary from this discussion:
"full-text search" = searching for words within text content (vs. exact match or structured query)
"tsvector / tsquery" = Postgres built-in types for full-text search: tsvector = preprocessed text document; tsquery = search query with operators
"pg_trgm" = Postgres extension for trigram-based fuzzy text matching ("trgm" = trigram, 3-character substrings)
"p95 latency" = 95th percentile latency — 95% of requests complete within this time; the "p95" convention is standard in performance benchmarking
"prefix matching" = search matches words by their beginning (e.g. "data" matches "database", "datastore")
"relevance ranking" = ordering results by how well they match the query (Elasticsearch's BM25 algorithm is a gold standard)
"operational overhead" = the ongoing cost in time, effort, and risk to maintain a system
"CONCURRENTLY" = a Postgres keyword for building indexes without locking the table for reads/writes
Why "deliberate architectural bet" matters: This phrase signals the lead has frameworks for making decisions under uncertainty — they're not pretending to have perfect foresight, but are making a reasoned choice with an explicit fallback plan.
2 / 3
📄 Transcript
[Architecture review — senior engineer presenting database sharding proposal.]
Senior: "I want to take 15 minutes to walk through why I'm recommending against sharding the user database right now — and what I think we should do instead.
The proposal to shard comes from a real concern: our largest tenant has 4.8 million rows, and we're running some complex aggregation queries that take 3 to 4 seconds. That's a real problem. But I think the diagnosis — 'we need sharding' — is jumping past the actual investigation.
Here's what I mean. Sharding is a solution to a specific problem: horizontal write throughput exceeds what a single node can handle, or your dataset is so large that a single node can't hold it. Neither of those is our situation. Our write throughput is well within single-node capacity. Our dataset is 90GB — that's a large Postgres database, but it's nowhere near single-node limits.
What we actually have is a query performance problem. And query performance problems have a specific diagnostic path: explain analyse the slow queries, look at index usage, check for full table scans. When I ran that on the 3-second aggregation query, I found it was doing a sequential scan on a 4.8M-row table because the composite index we have doesn't include the tenant_id column in the right position.
Adding tenant_id as the leading column of that index brings the query to 80ms. That's a 40x improvement. No sharding required."
Reviewer: "What's the risk of that index change?"
Senior: "The index build itself — CONCURRENTLY, so no table lock. The risk is index size: it adds another 600MB. We have headroom. I also want to add a query timeout at the application level as a circuit breaker while the index builds."
What diagnostic method does the senior engineer use, and what is the key distinction they draw to argue against sharding?
Root cause analysis before solution: EXPLAIN ANALYSE reveals sequential scan due to wrong index column order. Sharding diagnosis was wrong — the problem is query performance, not write throughput or dataset size.
Database architecture vocabulary:
"sharding" = horizontally partitioning a database across multiple machines — each shard holds a subset of the data
"write throughput" = the rate at which a database can process write operations (inserts, updates, deletes)
"sequential scan" = reading every row in a table to find matching rows — slow on large tables; happens when no suitable index exists
"composite index" = an index on multiple columns together; column order matters — the leading column is used for prefix-matching scans
"leading column" = the first column in a composite index; queries filtering by this column can use the index efficiently
"EXPLAIN ANALYSE" = a Postgres command that executes a query and shows how it was executed: scan types, row estimates, actual vs. estimated costs
"circuit breaker" = a pattern that stops sending requests to a failing/slow resource to prevent cascading failures; here used as a query timeout
"horizontal vs. vertical scaling": sharding = horizontal (more machines); better indexes = making current infrastructure work better
The engineering discipline demonstrated: Hypothesis-driven debugging — "I think the diagnosis is wrong" → run diagnostics → validate. This prevents expensive architectural changes based on misdiagnosis.
3 / 3
📄 Transcript
[Architecture review meeting — discussing event-driven migration proposal for a monolith.]
Architect: "Before we decide, I want to make sure we're solving the right problem. The request was to introduce an event bus to decouple the order service from the inventory service. Let me test that framing.
The pain point is: when the inventory service is slow or unavailable, the order creation path blocks. That's real — we've had three incidents this quarter where inventory latency spikes caused order creation timeouts. But 'add an event bus' isn't the only solution to that pain point. Let me name the alternatives.
Option one: event bus — orders publish an event, inventory consumes it asynchronously. Decoupled. But now order status is eventually consistent — 'your order was placed' doesn't mean inventory was actually reserved when you got that confirmation. We need to handle the case where inventory later fails to reserve. That's compensation logic, saga patterns, and significant complexity.
Option two: circuit breaker on the inventory call. Orders fail fast if inventory is slow, with a fallback response. Simpler. But orders still fail when inventory is unavailable — just faster.
Option three: optimistic inventory reservation. Deduct inventory speculatively at order time, reconcile after. Works if over-commitment is acceptable — it isn't for us.
Option four: async write-through cache. Inventory service writes its state to a cache; the order service reads from the cache synchronously. The cache is invalidated asynchronously. This gives us the fast read path, tolerates inventory service latency, and keeps the order confirmation semantics synchronous.
My recommendation is option four with a TTL-based cache invalidation policy and a background reconciliation job. It solves the actual problem — latency spikes — without the consistency complexity of full event sourcing."
What does the architect's analytical approach reveal about how they evaluate architectural options? Why do they reject the event bus (option one) specifically?
Problem reframing → four options → evaluation against specific requirements → rejection of event bus because its consistency model adds complexity disproportionate to the actual problem (latency spikes, not fundamental decoupling need).
Distributed systems and event-driven architecture vocabulary:
"event bus" = a middleware component that accepts events from producers and distributes them to consumers; enables loose coupling between services
"eventual consistency" = a consistency model where, after a write, all replicas will eventually converge — but reads may temporarily see stale data
"compensation logic" = business logic that undoes or corrects a prior action when a subsequent step fails in a distributed transaction
"saga pattern" = a way to manage distributed transactions: a sequence of local transactions, each publishing an event; if any step fails, compensating transactions undo previous steps
"circuit breaker" = a pattern that stops calls to a failing service after a threshold of failures, returning a default response; prevents cascading failures
"optimistic reservation" = speculatively reserving a resource assuming success, then correcting later if it fails
"write-through cache" = a cache that is updated synchronously when the backing store is written, keeping cache and store consistent
"TTL (Time-To-Live)" = a cache expiry duration — after TTL, the cached value is invalidated and must be refreshed
The key insight: "Event bus" is a hammer — the architect checks whether the nail is actually a consistency problem or a latency problem. It's a latency problem. The cache solves it with far less complexity.