· · AI

CEO of Tallyfy · AI advisor at Blue Sheen for mid-size companies

Building RAG systems that actually work in production

Most RAG systems fail at retrieval, not generation. Research from Anthropic and kapa.ai confirms the retrieval layer matters most. Chunking strategy, hybrid search, and proper evaluation determine whether your RAG system works in production or joins the 70% that fail.

The short version

Chunking strategy determines retrieval quality - How you break documents into pieces has more impact than which embedding model you choose, with semantic chunking outperforming fixed-size approaches by several percentage points in retrieval accuracy

  • Hybrid search beats pure semantic search - Combining keyword search with semantic search catches exact matches that embeddings miss, especially for technical terms and code
  • Measure retrieval before generation - Track precision and recall on retrieved documents separately from response quality to identify where systems actually break

Building a RAG system? Starting with the language model is the wrong place.

The model can only work with what you retrieve. I keep watching teams spend weeks tweaking prompts and generation parameters while their retrieval layer is actively broken. It’s frustrating. kapa.ai’s work with production teams at Docker, CircleCI, and Reddit found that data quality problems account for most production failures, not model limitations.

Your RAG system is a retrieval system first. Generation is secondary. Understanding the broader LLMOps discipline helps you plan for what comes after the first demo.

Mid-2026 update: context windows grew enough to change when you need RAG at all. Anthropic’s current models, Sonnet 4.6 up through Fable 5, all carry 1M-token context windows with no pricing premium beyond 200k tokens. For a small, stable corpus you can now skip retrieval and load the whole thing into context. The argument below survives at production scale: you pay for every token you send, access control lives at the retrieval layer, and handing the model the right 2,000 tokens beats making it wade through a million.

Why retrieval is where RAG breaks

The first mistake most builders make: focusing on the language model instead of document quality. Documents get chunked without thinking about semantic boundaries, embedded with whatever model is popular this week, then thrown in a vector database. The assumption is that similarity search will figure out the rest.

RAG pipeline from document ingestion through chunking, embedding, hybrid search, reranking, to LLM generation

Turns out, it doesn’t.

Up to 70% of RAG systems fail in production despite working fine in demos. Which is nuts, frankly. The gap between demo and production? Your demo uses clean, carefully formatted test documents. Production data is messy. PDFs with tables. Legal documents with nested clauses. Code with inconsistent formatting. Support tickets written by people who clearly did not proofread.

Simple fixed-size chunking applied to this reality produces nonsensical pieces. A chunk that starts mid-sentence and ends mid-thought. A table split across three chunks with no context. Critical information separated from the question it was meant to answer.

The embedding model can’t rescue you from bad chunks. Rubbish in, rubbish out.

Chunking: the decision that shapes everything

Research on chunking strategies shows semantic chunking can outperform fixed-size approaches by several percentage points in retrieval accuracy, though the gap varies by document type. Chunking that respects document structure, breaking at section and page boundaries, holds up better across varied content than splitting blindly.

Why doesn’t everyone use semantic chunking then? Because it takes more work, and most teams don’t realize retrieval quality is the bottleneck until production is already broken.

Start with roughly 250 tokens per chunk. That’s about 1000 characters. Not because this is optimal, but because it’s a sensible baseline for testing. Too small and you lose context. Too large and you retrieve irrelevant information alongside the parts you actually need. I said sensible baseline above. It is more like an educated guess.

More important than chunk size: semantic completeness. A chunk should contain a complete thought. Break technical documentation at section boundaries. Respect clause structure in legal documents. Keep questions and answers together in support tickets. I think most teams underestimate how much this first architectural choice constrains everything downstream.

Teams building production systems learned this the hard way. They started with simple chunking, watched retrieval quality fall apart with real data, and spent months rebuilding from scratch with document-aware strategies.

The most advanced approach available: Anthropic’s Contextual Retrieval. For each chunk, an LLM prepends context explaining how that chunk relates to its parent document. This reduced retrieval failures by 35-67% in their testing, though it adds real compute cost since you’re running an LLM call per chunk during indexing. A lighter alternative is late chunking from Jina AI. The entire document gets embedded at the token level first, then segmented and pooled afterward. This preserves cross-chunk context on documents where meaning depends on earlier sentences, without requiring an LLM call per chunk.

Hybrid search and why pure semantic falls short

Pure semantic search misses exact matches. Can you fix this with a better embedding model? No. Ask about “BM25 algorithm” and semantic search might return documents about ranking methods without mentioning BM25 by name. Ask for a specific error code and you get general troubleshooting instead of the precise answer you need.

Hybrid search combines semantic and keyword approaches. Stephen Robertson’s BM25 handles keyword matching. Vector similarity handles conceptual relationships. Together they catch both: semantically similar content through embeddings, exact terminology through keywords.

This matters most for technical content. Code snippets. Error messages. Product names. Acronyms. All the cases where exact matching beats semantic similarity.

Implementation is straightforward. Run both searches in parallel, then combine results using reciprocal rank fusion or score normalization. Research consistently shows hybrid approaches achieving better precision and recall than either method alone. Hybrid search and reranking are becoming defaults in production RAG systems, especially for policy and legal corpora.

Most vector databases support this natively now. Weaviate builds hybrid search into its core engine. Pinecone offers it through sparse vectors, and their Dedicated Read Nodes launched in December 2025 sustain 600 QPS across 135 million vectors. Even PostgreSQL with pgvector 0.8.0 delivers up to 9x faster query processing and up to 100x more relevant results. Teams report 20-40% improvement in retrieval quality just by adding keyword search to a semantic pipeline. Not a subtle difference.

When you want to take this further, Blue Sheen helps firms work through this.

Add a reranker after initial retrieval. Reranking reorders results so the most relevant information appears first before being passed to the LLM. Without a reranker, cosine similarity rewards proximity, not usefulness. Cross-encoder models or late interaction approaches like Omar Khattab’s ColBERT provide the most accurate reranking, though at higher compute cost. Milvus 2.6 now includes built-in Boost Ranker and Decay Ranker functions for combining semantic similarity with contextual relevance.

Measuring what’s actually working

The hardest part of building a RAG system isn’t the code. It’s knowing whether it works. RAG evaluation is tricky because you’re measuring two distinct things: retrieval quality and generation quality.

Separate them.

For retrieval: precision and recall. Of the chunks you retrieved, what percentage were actually relevant? Of all relevant chunks in your corpus, what percentage did you find? Teams using RAGAS and similar frameworks track these metrics continuously. RAGAS pioneered reference-free evaluation and remains the most-cited framework for RAG assessment. It measures faithfulness (is the output grounded in retrieved documents?), answer relevance (does it address the query?), context precision, and context recall. You can pinpoint exactly where your pipeline breaks.

Start with a curated test set. Take 50-100 real queries. Manually identify which documents should be retrieved for each. That’s your ground truth.

Track metrics that matter:

  • Precision at k (are my top 5 results relevant?)
  • Recall (did I find all relevant documents?)
  • Mean reciprocal rank (where does the first relevant result appear?)
  • NDCG (are more relevant results ranked higher?)

Don’t just measure final responses. A RAG system can give good answers despite bad retrieval if it gets lucky, or give bad answers despite perfect retrieval if generation fails. You need to know which component is breaking and why.

For generation quality, track faithfulness and answer relevance. Faithfulness: is the answer actually supported by the retrieved documents? Answer relevance: does it address the question being asked? Automated scores only take you so far. Once real users arrive, user feedback beats automated metrics.

Building for production from day one

What production-ready architecture actually looks like.

Delta processing for document updates. Don’t re-embed your entire document collection when one page changes. Build a system like git diff that only processes what changed. Saves compute, reduces latency, prevents version drift.

Monitoring and alerting. Track retrieval latency, embedding generation time, and database query performance. Set alerts for sudden drops in precision or spikes in retrieval time. Production systems need observability to catch degradation before users notice. Too many teams rely on manual spot-checks and one-off experiments, which leads to painful iteration cycles and mysterious production failures.

Fallback strategies for missing information. RAG systems break when asked about topics not in the knowledge base. Detect low-confidence retrievals and handle them explicitly rather than letting the model hallucinate.

Security at the retrieval layer. OWASP added Vector and Embedding Weaknesses as LLM08 in their 2025 Top 10 LLM risks, covering embedding inversion, adversarial embeddings, and cross-context leakage. Research on PoisonedRAG demonstrated that a small number of crafted documents can reliably manipulate AI responses through indirect prompt injection. Enforce access control at the embeddings retrieval layer and verify document provenance. RAG amplifies whatever security posture you already have, weak or strong.

Cost optimization matters at scale. Embedding models vary in cost and latency. OpenAI’s text-embedding-3 family remains widely used. That said, open source alternatives like BGE-M3 or E5-Mistral offer comparable performance at lower cost. Voyage AI’s v4 series now leads benchmarks, outperforming competitors by 8-14%.

Vector database choice shapes both performance and cost. Pinecone handles billions of vectors with sub-50ms latency. Weaviate offers flexibility and hybrid search with both managed and open source options. Qdrant excels at complex filtering with strong multitenancy and is now SOC 2 Type II certified. Chroma works for prototypes and smaller deployments. Milvus 2.6 and Zilliz Cloud lead for enterprise scale, with tiered storage delivering 87% storage cost reduction by automatically moving data between hot, warm, and cold tiers. Benchmark with your actual data before committing to any of these.

The teams that succeed with RAG aren’t the ones with the fanciest models. They’re the ones who treated retrieval as the hard problem it is, and built accordingly.

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Related Posts

View All Posts »
The data quality problem that breaks AI

The data quality problem that breaks AI

The data quality problem that breaks AI is not imperfect data - it is how AI learns from your existing data problems and multiplies them until they destroy everything you built, with a RAND Corporation study finding more than 80 percent of AI projects fail, and poor data quality among the leading culprits

The hidden costs of RAG: Why your budget is 3x too low

The hidden costs of RAG: Why your budget is 3x too low

RAG implementations cost 2-3x initial estimates. Benchmarkit found 85% of organizations misestimate AI costs by more than 10%. Vector databases, embedding APIs, development time, and ongoing optimization add up quickly. Learn what teams consistently underestimate and how to budget accurately from day one.

Technology is only a small part of driving the value of AI

Technology is only a small part of driving the value of AI

DBS Bank expects more than $780 million in economic value from AI this year. Their CEO told a Fortune conference to stop hiring for knowledge and start hiring for attitude. Walmart, Starbucks, JPMorgan, and Caterpillar all arrived at the same conclusion: the technology was the easy part.

The true cost of AI - why human time is your biggest expense

The true cost of AI - why human time is your biggest expense

Most AI budgets focus on software and infrastructure while ignoring the massive human time investment. RAND research confirms more than 80% of AI projects fail, and 85% of organizations misestimate project costs because they do not count employee hours, integration work, productivity losses, and opportunity costs. Here is a framework for calculating the true total cost of AI implementation.

Knowledge graphs vs vector search: Why the hybrid approach wins

Knowledge graphs vs vector search: Why the hybrid approach wins

Choosing between knowledge graphs and vector databases is a false choice. Knowledge graphs excel at structured relationships while vector databases handle semantic similarity, but the HybridRAG study shows combining both delivers measurably better accuracy on complex queries. Here is how to decide which approach fits your specific problem.

AI advisory services via Blue Sheen.
Contact me Follow 10k+