RAG Architecture Patterns That Hold Up in Production
A production-focused guide to retrieval-augmented generation: chunking, indexing, retrieval, reranking, and grounding.
RAG Architecture Patterns That Hold Up in Production
Retrieval-Augmented Generation (RAG) is often presented as "embed documents, search, prompt model." In production, reliable RAG requires architecture choices that reduce hallucinations, control cost, and support traceability.
This article covers patterns that consistently work in real systems.
1. Ingestion as a pipeline, not a script
Treat ingestion like ETL:
- Extract: parse source formats (PDF, HTML, Markdown, DOCX)
- Normalize: remove boilerplate, preserve headings and tables where possible
- Chunk: split text with deterministic boundaries
- Enrich: attach metadata (source, timestamp, owner, ACL, section path)
- Index: write embeddings + metadata to vector store
If ingestion is ad hoc, retrieval quality will degrade as content grows.
2. Chunking strategy matters more than model size
Common production baseline:
- 300-800 token chunks
- 10-20% overlap
- Heading-aware splitting
Why: tiny chunks lose context; huge chunks hurt retrieval precision and increase prompt cost.
Keep chunking deterministic and versioned so index rebuilds are reproducible.
3. Hybrid retrieval beats pure vector search
Use both:
- Lexical search (BM25 or equivalent) for exact terms, product names, IDs
- Vector similarity for semantic matches
Then fuse and rerank. This pattern handles both keyword-heavy and semantic queries better than either method alone.
4. Add reranking before generation
Initial retrieval returns candidates; reranking improves final context quality.
Typical flow:
- Retrieve top 20-50 candidates
- Rerank to top 5-10 using a cross-encoder or LLM-based ranker
- Pass only top grounded chunks to generation
This often improves factual quality and reduces irrelevant context in the final prompt.
5. Enforce grounding in prompt and output
Generation prompt should require:
- Citation markers tied to retrieved chunks
- Explicit refusal when evidence is insufficient
- No claims outside provided context for high-trust domains
Return structured output:
{
"answer": "...",
"citations": ["doc-123#chunk-7", "doc-415#chunk-3"],
"confidence": "low|medium|high"
}
Structured outputs make validation and UI rendering far easier.
6. Apply access control at retrieval time
Never retrieve across tenant or role boundaries and hope the LLM ignores restricted context. Access control belongs in the retriever query path:
- Tenant filter
- Role-based metadata filter
- Document-level ACL checks
Security boundary first, generation second.
7. Support freshness and reindexing
RAG quality depends on current data.
Use:
- Incremental reindex on source changes
- Full rebuild jobs for schema/embedding model updates
- Index versioning + cutover plan
If you cannot rebuild safely, you cannot evolve the system.
8. Add observability across stages
Instrument each step:
- Query rewrite latency
- Retrieval latency + candidate count
- Reranker latency
- Prompt token count
- Generation latency + completion tokens
- Citation coverage and answer refusal rate
Without per-stage telemetry, you cannot diagnose quality or cost regressions.
9. Cache where it actually helps
Useful cache layers:
- Embedding cache for duplicate chunks
- Retrieval cache for repeated queries in read-heavy apps
- Response cache for strict deterministic requests
Avoid caching where personalization or data freshness requirements make cache invalidation complex.
10. Define failure behavior explicitly
RAG should fail safely:
- Vector store unavailable: fallback to lexical-only mode or safe error
- No relevant context: refuse and ask clarifying question
- Context too large: compress or prioritize before generation
Clear failure paths prevent silent low-quality answers.
Reference architecture
- Query enters API
- Optional normalization / rewrite
- Hybrid retrieval (vector + lexical)
- Metadata and ACL filtering
- Reranking
- Context packing with token budget
- Grounded generation with citation contract
- Post-checks (schema, citation validity)
- Response + telemetry
This architecture is modular enough to evolve and strict enough to remain reliable under load.
Final note
RAG in production is an information retrieval system first and an LLM system second. The teams that get this right invest heavily in ingestion quality, retrieval evaluation, and grounding guarantees before tuning prompts.
Need help with your infrastructure?
Let's discuss your project and find the best solution together.
Get in touchRelated articles
AI Agents for Infrastructure Operations
How to integrate LLM-powered agents into monitoring, incident response, and automated ticket management workflows.
Cloud Architecture Through the Shared Responsibility Model
How to design cloud systems with clear provider/customer boundaries for security, reliability, and operations.