The naive RAG era is over
The original RAG pattern — embed documents, store in a vector database, retrieve the top-k chunks, stuff into context — worked well enough to generate a lot of demos. It works poorly in production.
The issues are well-documented at this point: semantic search retrieves superficially similar chunks that don't actually answer the question, large chunks dilute relevant content with irrelevant context, small chunks lose the surrounding context needed to understand them, and retrieved content isn't always used by the model even when it's relevant.
2026 RAG is a genuinely different thing from 2023 RAG.
Architecture decisions that matter
Chunking strategy
Fixed-size chunking by token count is the wrong default. Chunk by semantic unit: paragraphs, sections, or document-specific structures (headings, clauses). Overlap between chunks should preserve sentence boundaries, not arbitrary token counts.
For complex documents, consider hierarchical chunking: index at multiple granularities and retrieve at multiple levels. Store document summaries separately from detail chunks. Use the summary to route, the detail to answer.
Embedding models
Don't use general-purpose embeddings for domain-specific retrieval. Fine-tuned embedding models consistently outperform off-the-shelf models on domain-specific corpora.
If you can't fine-tune, late interaction models (ColBERT variants) typically outperform bi-encoder models on complex queries. The latency cost is real but often worth it for accuracy-critical applications.
Hybrid search
Combine dense vector search with BM25 sparse retrieval. Neither alone is optimal — dense search is better at semantic similarity, sparse search is better at exact keyword matches. Reciprocal rank fusion is a simple and effective combination strategy.
The reranking layer
A retriever that returns 20 candidates feeding into a reranker that selects the top 5 consistently outperforms a retriever returning the top 5 directly. The retriever optimizes for recall, the reranker optimizes for precision.
Cross-encoder rerankers (models that see both query and document together) outperform bi-encoder rerankers but are slower. For most production applications, running the cross-encoder over 20 candidates is fast enough to be worth the accuracy gain.
Dealing with the generation side
Retrieved context isn't always used. Instruct the model explicitly to ground its answers in the retrieved content and to indicate when it's drawing on knowledge not in the provided documents.
Citation generation — where the model explicitly references which document it's drawing from — dramatically improves faithfulness in our experiments. It forces the model to localize its claim to specific retrieved content.
Evaluation is non-negotiable
RAG systems need two evaluation layers: retrieval quality (are we getting the right documents?) and generation quality (is the model using them correctly?). Most teams only evaluate the latter and miss retrieval failures.
Build a test set of question-document pairs before you start building. Run RAGAs or LLM-as-judge evaluation on your actual production queries, not just a held-out development set.