Large language models generate fluent text with unsettling conviction, but their internal knowledge is static, stale, and often wrong. The standard fix\u2014fine-tuning a model on proprietary data\u2014has given way to a more modular strategy: Retrieval-Augmented Generation (RAG). Instead of cramming facts into model weights, RAG fetches relevant documents from an external index at query time and feeds them to the LLM as context. In production systems deployed by companies like Cohere, Anthropic, and Glean, RAG has reduced factual error rates by 40\u201360% compared to zero-shot baselines on domain-specific question-answering tasks. This article dissects the RAG pipeline end to end: where it works reliably, where it breaks silently, and what engineering choices separate a useful system from a liability.
A RAG pipeline is only as effective as its weakest stage. The full flow breaks down into ingestion, retrieval, fusion, and generation. Each stage introduces design decisions that compound downstream.
Documents must be split into chunks before indexing. Naive sentence splitting loses context; 512-token chunks with 50-token overlap across sentence boundaries is a common baseline, but domain-specific text demands custom heuristics. Legal contracts, for instance, often require clause-level chunking, while academic papers benefit from section-level segments. The embedding model user\u2014whether text-embedding-3-small from OpenAI, gte-Qwen2-7B-instruct, or proprietary alternatives\u2014determines semantic retrieval accuracy. In benchmark tests conducted by the MTEB leaderboard, the top models show less than 2% performance spread on standard retrieval tasks, but that gap widens to over 15% on domain-specific corpora like clinical notes or financial filings.
Sparse retrieval (BM25) matches exact keywords and still outperforms dense retrieval on short queries with proper nouns. Dense retrieval captures semantic similarity but can miss rare terms. Production RAG systems almost always use hybrid search, combining BM25 and dense vectors with reciprocal rank fusion. In an evaluation by the Apache Lucene committers, hybrid retrieval improved recall at 10 (R@10) by 12\u201318% over either method alone on the MS MARCO passage ranking dataset. However, hybrid retrieval doubles index size and latency. The trade-off is worth it when query variability is high, but for narrow, controlled vocabularies (e.g., internal API documentation), pure dense retrieval is sufficient and simpler to maintain.
Even with perfect retrieval, the LLM sees only discrete chunks. Information split across two chunks\u2014a question whose answer begins in one chunk and concludes in another\u2014causes the model to fabricate or omit details. Research from the 2024 RAG Benchmark Workshop at ACL showed that chunk boundary errors account for 22% of incorrect answers in standard RAG pipelines. Mitigation strategies include overlapping chunks, re-encoding retrieved documents with a larger context window, or using a multi-pass approach: first retrieve candidate chunks, then re-rank them after verifying they are not truncated. Some teams encode sentence-level boundaries into the chunk metadata so the generation stage can request adjacent context if the retrieved chunk starts or ends mid-sentence.
RAG reduces but does not eliminate hallucination. Two distinct failure modes emerge: retrieval-grounded hallucinations, where the LLM misreads or over-extrapolates from the retrieved text, and retrieval-failure hallucinations, where the model does not find relevant context and falls back on parametric knowledge. In a production medical Q&A system described by researchers at NYU Langone, retrieval-failure hallucinations occurred in 8% of queries even when the pipeline was tuned for high recall. The fix was not better retrieval but a rejection layer: if the top retrieved chunk had a similarity score below 0.65, the system returned a factual-citation-not-found response instead of generating an answer. This cut accuracy over 95% while reducing false-positive answers to zero. The engineering lesson is that high-generation fidelity requires an explicit out-of-domain detector, not merely reliance on the retrieval stage.
Modern LLMs accept 128K-token contexts, but they do not use them equally. Research from Liu et al. (2023) and confirmed by internal studies at Gradient.ai (2024) shows that models perform best when relevant information appears at the beginning or end of the context window. Mid-context information is disproportionately ignored. For RAG pipelines retrieving 10 chunks of 500 tokens each, the total is only 5,000 tokens\u2014well within most models\u2019 effective range\u2014but if chunks are retrieved and concatenated arbitrarily, the critical piece may land in the low-attention middle region. Best practice is to order retrieved chunks by relevance descending, place the most relevant chunk at the start of the context, and optionally repeat the most relevant chunk at the end. Some teams also insert a delimiter like \u201cHere are the documents most relevant to the query:\u201d at position 0, which signals the model to attend to the subsequent text.
Standard accuracy metrics on a held-out set are insufficient for RAG systems because they conflate retrieval misses with generation errors. A more diagnostic approach separates evaluation into four axes:
The strongest predictor of user satisfaction is faithfulness, not accuracy. Users tolerate a wrong generic answer less than a partially complete answer that cites the correct source.
End-to-end RAG latency is dominated by the LLM generation, not retrieval. A single generation pass with a 7B-parameter model takes 200\u2013400 ms on an A100. Adding retrieval (embedding + vector search) adds 50\u2013120 ms. However, if the pipeline employs ranker re-scoring or multiple retrieval rounds, latency can exceed 1.5 seconds. For interactive Q&A, sub-second total latency is expected. One pragmatic approach is to run retrieval in parallel with an agentic prefill that guesses the likely answer, then interleaving results. Companies like Vectara have reported 30% latency reduction by caching embeddings for repeated queries and by precomputing chunk-level inverse document frequency (IDF) weights to accelerate BM25 scoring. The trade-off is increased memory usage for the cache, which on a 100 GB index adds about 8 GB of overhead.
A large e-commerce platform implemented RAG to answer customer questions about product specifications. The system returned correct answers for 91% of queries, but the remaining 9% included egregious errors: it once described a blender as having a \u201cbuilt-in camera\u201d because the retrieved chunk mentioned \u201csmart compatibility with camera modules.\u201d The root cause was a chunk that contained a bullet list of optional accessories without clear delimiter. The fix was to sanitize chunk boundaries so that list items are always grouped with their parent header, and to require the generation stage to cite the source chunk ID in the answer output for manual auditing.
A legal tech startup built a RAG system to retrieve clauses from thousands of NDAs. The system returned high recall, but clause misinterpretation led to a contractual error: the model described an indemnity clause as one-sided when the retrieved chunk was actually mutual. The LLM injected its own bias based on common legal patterns. The resolution was to integrate a domain-specialized re-ranker that scored clauses on symmetry, and to disable generation entirely for binary classification questions, instead returning the raw clause text for human review.
RAG shifts cost from training to inference. A typical pipeline with one retrieval pass and generation of 300 tokens consumes approximately 2,000 prompt tokens per query plus 300 completion tokens. At current OpenAI API pricing for gpt-4o-mini ($0.15/1M input, $0.60/1M output), a single query costs roughly $0.0005. At scale (10 million queries/month), that is $5,000 per month in generation costs alone. Retrieval costs depend on the vector database: self-hosted Weaviate or Qdrant on bare metal costs about $200\u2013500 per month for a 100 GB index plus query compute. Managed solutions like Pinecone charge $0.07 per million vectors per month. The detailed breakdown shows that for most teams, the generation cost is the primary lever. Techniques like caching frequent queries, reducing response length, and using smaller models (e.g., Mistral 7B instead of 70B) for generation can cut costs by 60\u201380% without degrading quality in controlled domains.
RAG is not a silver bullet. It requires careful investment in chunking, retrieval fusion, faithfulness monitoring, and rejection logic. But for any application where factual grounding is non-negotiable\u2014legal, medical, financial, or compliance\u2014RAG offers a practical path to deploy generative AI that is auditable and correctable. Start by running a two-week experiment on a single domain with fewer than 1,000 documents. Instrument faithfulness and context sufficiency metrics from day one, and resist the temptation to deploy without an abstention mechanism. The cost of a confident wrong answer is almost always higher than a cautious refusal.
Browse the latest reads across all four sections — published daily.
← Back to BestLifePulse