Back to articles
AI11 min read

RAG Architecture Patterns That Hold Up in Production

A production-focused guide to retrieval-augmented generation: chunking, indexing, retrieval, reranking, and grounding.

2026-02-22

RAG Architecture Patterns That Hold Up in Production

Retrieval-Augmented Generation (RAG) is often presented as "embed documents, search, prompt model." In production, reliable RAG requires architecture choices that reduce hallucinations, control cost, and support traceability.

This article covers patterns that consistently work in real systems.

1. Ingestion as a pipeline, not a script

Treat ingestion like ETL:

  • Extract: parse source formats (PDF, HTML, Markdown, DOCX)
  • Normalize: remove boilerplate, preserve headings and tables where possible
  • Chunk: split text with deterministic boundaries
  • Enrich: attach metadata (source, timestamp, owner, ACL, section path)
  • Index: write embeddings + metadata to vector store

If ingestion is ad hoc, retrieval quality will degrade as content grows.

2. Chunking strategy matters more than model size

Common production baseline:

  • 300-800 token chunks
  • 10-20% overlap
  • Heading-aware splitting

Why: tiny chunks lose context; huge chunks hurt retrieval precision and increase prompt cost.

Keep chunking deterministic and versioned so index rebuilds are reproducible.

Use both:

  • Lexical search (BM25 or equivalent) for exact terms, product names, IDs
  • Vector similarity for semantic matches

Then fuse and rerank. This pattern handles both keyword-heavy and semantic queries better than either method alone.

4. Add reranking before generation

Initial retrieval returns candidates; reranking improves final context quality.

Typical flow:

  1. Retrieve top 20-50 candidates
  2. Rerank to top 5-10 using a cross-encoder or LLM-based ranker
  3. Pass only top grounded chunks to generation

This often improves factual quality and reduces irrelevant context in the final prompt.

5. Enforce grounding in prompt and output

Generation prompt should require:

  • Citation markers tied to retrieved chunks
  • Explicit refusal when evidence is insufficient
  • No claims outside provided context for high-trust domains

Return structured output:

{
  "answer": "...",
  "citations": ["doc-123#chunk-7", "doc-415#chunk-3"],
  "confidence": "low|medium|high"
}

Structured outputs make validation and UI rendering far easier.

6. Apply access control at retrieval time

Never retrieve across tenant or role boundaries and hope the LLM ignores restricted context. Access control belongs in the retriever query path:

  • Tenant filter
  • Role-based metadata filter
  • Document-level ACL checks

Security boundary first, generation second.

7. Support freshness and reindexing

RAG quality depends on current data.

Use:

  • Incremental reindex on source changes
  • Full rebuild jobs for schema/embedding model updates
  • Index versioning + cutover plan

If you cannot rebuild safely, you cannot evolve the system.

8. Add observability across stages

Instrument each step:

  • Query rewrite latency
  • Retrieval latency + candidate count
  • Reranker latency
  • Prompt token count
  • Generation latency + completion tokens
  • Citation coverage and answer refusal rate

Without per-stage telemetry, you cannot diagnose quality or cost regressions.

9. Cache where it actually helps

Useful cache layers:

  • Embedding cache for duplicate chunks
  • Retrieval cache for repeated queries in read-heavy apps
  • Response cache for strict deterministic requests

Avoid caching where personalization or data freshness requirements make cache invalidation complex.

10. Define failure behavior explicitly

RAG should fail safely:

  • Vector store unavailable: fallback to lexical-only mode or safe error
  • No relevant context: refuse and ask clarifying question
  • Context too large: compress or prioritize before generation

Clear failure paths prevent silent low-quality answers.

Reference architecture

  1. Query enters API
  2. Optional normalization / rewrite
  3. Hybrid retrieval (vector + lexical)
  4. Metadata and ACL filtering
  5. Reranking
  6. Context packing with token budget
  7. Grounded generation with citation contract
  8. Post-checks (schema, citation validity)
  9. Response + telemetry

This architecture is modular enough to evolve and strict enough to remain reliable under load.

Final note

RAG in production is an information retrieval system first and an LLM system second. The teams that get this right invest heavily in ingestion quality, retrieval evaluation, and grounding guarantees before tuning prompts.

Share this article

Need help with your infrastructure?

Let's discuss your project and find the best solution together.

Get in touch