Most retrieval-augmented generation pipelines fail silently. They return plausible-sounding answers grounded in the wrong documents, and nobody notices until it matters.
After building several production retrieval systems, here are the patterns that actually hold up.
The naive approach breaks fast
The standard tutorial pipeline — chunk documents, embed them, retrieve top-k, stuff into a prompt — works for demos. It fails for three reasons in production:
- Chunk boundaries destroy context. A 512-token chunk that starts mid-paragraph loses the referent of “it” or “this approach.”
- Semantic similarity ≠ relevance. The most similar embedding is often a paraphrase of the question, not the answer.
- Top-k is arbitrary. Sometimes you need one document. Sometimes twelve. A fixed k guarantees you’re wrong.
What works instead
Hierarchical retrieval
Index at multiple granularities — document-level summaries, section-level chunks, and sentence-level fragments. Route the query to the right level first, then drill down.
Query decomposition
Before retrieving, break complex questions into sub-queries. “How does our pricing compare to competitors in the European market?” becomes three separate retrieval operations.
Retrieval with verification
After retrieval, run a lightweight check: does this chunk actually answer the question? A small classifier or even a prompted LLM call can filter out the 30% of results that are similar-but-irrelevant.
The meta-lesson
The best retrieval systems look less like search engines and more like research assistants. They decompose, verify, and synthesize — rather than fetching and hoping.
Build the verification step first. Everything else is optimization.