AI

Building RAG systems that actually work in production

Most RAG systems fail at retrieval, not generation. The language model can only be as good as the documents you give it. Learn how chunking strategy, hybrid search, and proper evaluation determine whether your RAG system works in production or joins the 70% that fail.

Most RAG systems fail at retrieval, not generation. The language model can only be as good as the documents you give it. Learn how chunking strategy, hybrid search, and proper evaluation determine whether your RAG system works in production or joins the 70% that fail.

The short version

Chunking strategy determines retrieval quality - How you break documents into pieces has more impact than which embedding model you choose, with semantic chunking outperforming fixed-size approaches by several percentage points in retrieval accuracy

  • Hybrid search beats pure semantic search - Combining keyword search with semantic search catches exact matches that embeddings miss, especially for technical terms and code
  • Measure retrieval before generation - Track precision and recall on retrieved documents separately from response quality to identify where systems actually break

Building a RAG system? Starting with the language model is the wrong place.

The model can only work with what you retrieve. I keep watching teams spend weeks tweaking prompts and generation parameters while their retrieval layer is actively broken. It’s genuinely frustrating. kapa.ai’s work with production teams at Docker, CircleCI, and Reddit found that data quality problems account for most production failures, not model limitations.

Your RAG system is a retrieval system first. Generation is secondary.

Why retrieval is where RAG breaks

The first mistake most builders make: focusing on the language model instead of document quality. Documents get chunked without thinking about semantic boundaries, embedded with whatever model is popular this week, then thrown in a vector database. The assumption is that similarity search will figure out the rest.

It doesn’t.

Up to 70% of RAG systems fail in production despite working fine in demos. The gap between demo and production? Your demo uses clean, carefully formatted test documents. Production data is messy. PDFs with tables. Legal documents with nested clauses. Code with inconsistent formatting. Support tickets written by people who clearly did not proofread.

Simple fixed-size chunking applied to this reality produces nonsensical pieces. A chunk that starts mid-sentence and ends mid-thought. A table split across three chunks with no context. Critical information completely separated from the question it was meant to answer.

The embedding model can’t rescue you from bad chunks. Garbage in, garbage out.

Chunking: the decision that shapes everything

Research on chunking strategies shows semantic chunking outperforming fixed-size approaches by several percentage points in retrieval accuracy. Page-level chunking performs even better, with the lowest variance across document types in published benchmarks.

Why doesn’t everyone use semantic chunking then? Because it takes more work, and most teams don’t realize retrieval quality is the bottleneck until production is already broken.

Start with roughly 250 tokens per chunk. That’s about 1000 characters. Not because this is optimal, but because it’s a sensible baseline for testing. Too small and you lose context. Too large and you retrieve irrelevant information alongside the parts you actually need.

More important than chunk size: semantic completeness. A chunk should contain a complete thought. Break technical documentation at section boundaries. Respect clause structure in legal documents. Keep questions and answers together in support tickets. I think most teams underestimate how much this first architectural choice constrains everything downstream.

Teams building production systems learned this the hard way. They started with simple chunking, watched retrieval quality fall apart with real data, and spent months rebuilding from scratch with document-aware strategies.

The most sophisticated approach available: Anthropic’s Contextual Retrieval. For each chunk, an LLM prepends context explaining how that chunk relates to its parent document. This reduced retrieval failures by 35-67% in their testing, though it adds real compute cost since you’re running an LLM call per chunk during indexing. A lighter alternative is late chunking from Jina AI. The entire document gets embedded at the token level first, then segmented and pooled afterward. This improves retrieval accuracy by approximately 2-4% on documents with anaphoric references without requiring LLM calls per chunk.

Hybrid search and why pure semantic falls short

Pure semantic search misses exact matches. Ask about “BM25 algorithm” and semantic search might return documents about ranking methods without mentioning BM25 by name. Ask for a specific error code and you get general troubleshooting instead of the precise answer you need.

Hybrid search combines semantic and keyword approaches. BM25 handles keyword matching. Vector similarity handles conceptual relationships. Together they catch both: semantically similar content through embeddings, exact terminology through keywords.

This matters most for technical content. Code snippets. Error messages. Product names. Acronyms. All the cases where exact matching beats semantic similarity.

Implementation is straightforward. Run both searches in parallel, then combine results using reciprocal rank fusion or score normalization. Research consistently shows hybrid approaches achieving better precision and recall than either method alone. Hybrid search and reranking are becoming defaults in production RAG systems, particularly for policy and legal corpora.

Most vector databases support this natively now. Weaviate’s hybrid search scored higher on both page and paragraph-level retrieval in testing. Pinecone offers it through sparse vectors, and their Dedicated Read Nodes launched in December 2025 sustain 600+ QPS across 135 million vectors. Even PostgreSQL with pgvector 0.8.0 delivers up to 9x faster query processing and up to 100x more relevant results. Teams report 30-40% improvement in retrieval quality just by adding keyword search to a semantic pipeline. Not a subtle difference.

Add a reranker after initial retrieval. Reranking reorders results so the most relevant information appears first before being passed to the LLM. Without a reranker, cosine similarity rewards proximity, not usefulness. Cross-encoder models or late interaction approaches like ColBERT provide the most accurate reranking, though at higher compute cost. Milvus 2.6 now includes built-in Boost Ranker and Decay Ranker functions for combining semantic similarity with contextual relevance.

Measuring what’s actually working

The hardest part of building a RAG system isn’t the code. It’s knowing whether it works. RAG evaluation is tricky because you’re measuring two distinct things: retrieval quality and generation quality.

Separate them.

For retrieval: precision and recall. Of the chunks you retrieved, what percentage were actually relevant? Of all relevant chunks in your corpus, what percentage did you find? Teams using RAGAS and similar frameworks track these metrics continuously. RAGAS pioneered reference-free evaluation and remains the most-cited framework for RAG assessment. It measures faithfulness (is the output grounded in retrieved documents?), answer relevance (does it address the query?), context precision, and context recall. You can pinpoint exactly where your pipeline breaks.

Start with a curated test set. Take 50-100 real queries. Manually identify which documents should be retrieved for each. That’s your ground truth.

Track metrics that matter:

  • Precision at k (are my top 5 results relevant?)
  • Recall (did I find all relevant documents?)
  • Mean reciprocal rank (where does the first relevant result appear?)
  • NDCG (are more relevant results ranked higher?)

Don’t just measure final responses. A RAG system can give good answers despite bad retrieval if it gets lucky, or give bad answers despite perfect retrieval if generation fails. You need to know which component is breaking and why.

For generation quality, track faithfulness and answer relevance. Faithfulness: is the answer actually supported by the retrieved documents? Answer relevance: does it address the question being asked?

Building for production from day one

What production-ready architecture actually looks like.

Delta processing for document updates. Don’t re-embed your entire document collection when one page changes. Build a system like git diff that only processes what changed. Saves compute, reduces latency, prevents version drift.

Monitoring and alerting. Track retrieval latency, embedding generation time, and database query performance. Set alerts for sudden drops in precision or spikes in retrieval time. Production systems need observability to catch degradation before users notice. Too many teams rely on manual spot-checks and one-off experiments, which leads to slow iteration cycles and mysterious production failures.

Fallback strategies for missing information. RAG systems break when asked about topics not in the knowledge base. Detect low-confidence retrievals and handle them explicitly rather than letting the model hallucinate.

Security at the retrieval layer. OWASP added Vector and Embedding Weaknesses as LLM08 in their 2025 Top 10 LLM risks, covering embedding inversion, adversarial embeddings, and cross-context leakage. Research on PoisonedRAG demonstrated that a small number of crafted documents can reliably manipulate AI responses through indirect prompt injection. Enforce access control at the embeddings retrieval layer and verify document provenance.

Cost optimization matters at scale. Embedding models vary significantly in cost and latency. OpenAI’s text-embedding-3 family remains widely used. Open source alternatives like BGE-M3 or E5-Mistral offer comparable performance at lower cost. Voyage AI’s v4 series now leads benchmarks, outperforming competitors by 8-14%.

Vector database choice shapes both performance and cost. Pinecone handles billions of vectors with sub-50ms latency. Weaviate offers flexibility and hybrid search with both managed and open source options. Qdrant excels at complex filtering with strong multitenancy and is now SOC 2 Type II certified. Chroma works for prototypes and smaller deployments. Milvus 2.6 and Zilliz Cloud lead for enterprise scale, with tiered storage delivering 87% storage cost reduction by automatically moving data between hot, warm, and cold tiers. Benchmark with your actual data before committing to any of these.

The teams that succeed with RAG aren’t the ones with the fanciest models. They’re the ones who treated retrieval as the genuinely hard problem it is, and built accordingly.

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience and as the founder of Tallyfy (raised $3.6m), he helps mid-size companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding.

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.