LLM Evaluation in Production: A Practical Framework
How to evaluate LLM systems in production using offline tests, online monitoring, and human review loops.
LLM Evaluation in Production: A Practical Framework
If you deploy LLM features without evaluation, you are shipping a random system. Prompt quality, model behavior, and tool integrations all drift over time. Evaluation needs to be a first-class part of your software lifecycle.
This guide outlines a practical framework you can run with today.
1. Start with task-level definitions
A general "is the output good?" score is not enough. Define quality per task:
- Retrieval QA: groundedness, completeness, citation accuracy
- Classification: precision, recall, calibration
- Summarization: factual consistency, coverage, readability
- Agentic workflows: tool-call correctness, step success rate, task completion rate
If a metric cannot influence a release decision, it is not useful.
2. Build a representative eval set
Your eval set should mirror real user traffic, not only ideal prompts.
Include:
- Typical cases
- Boundary cases (very long input, partial context, ambiguous instructions)
- Known failure patterns from production tickets
- Adversarial inputs (prompt injection attempts, malformed payloads)
Use versioned datasets in Git so every model/prompt change can be compared against a fixed baseline.
3. Separate offline and online evaluation
You need both.
Offline (pre-release)
Run deterministic checks in CI before rollout:
- Schema conformance (JSON validity, required fields)
- Tool-call argument validation
- Retrieval citation checks
- Task-specific scoring (e.g., F1 for classification)
Online (post-release)
Monitor real behavior:
- Error rate (timeouts, parsing failures, tool failures)
- Latency p50/p95/p99
- Cost per request and per successful task
- User feedback signals (thumbs up/down, edit distance to accepted answer)
Offline keeps bad builds out; online catches drift and real-world edge cases.
4. Use a layered metric strategy
Do not rely on one "judge score." Use layers:
- Hard checks: deterministic pass/fail (schema, citation existence, policy blocks)
- Task metrics: exact match, F1, completion rate
- Model-based scoring: rubric-based quality checks by another model
- Human review: periodic sampling for nuanced quality and safety decisions
Hard checks should gate releases. Human review should calibrate your rubric over time.
5. Evaluate retrieval and generation separately in RAG
In RAG systems, errors often come from retrieval, not generation.
Track retrieval metrics:
- Recall@k
- Context precision (how much retrieved text is relevant)
- Citation hit rate
Track generation metrics:
- Answer correctness relative to evidence
- Citation faithfulness (claims supported by cited chunks)
- Refusal correctness when evidence is missing
This split makes root-cause analysis much faster.
6. Add safety and abuse evaluations
Production LLM systems must test misuse scenarios:
- Prompt injection in user-provided context
- Sensitive data extraction attempts
- Harmful instruction generation
- Data exfiltration via tool calls
Also verify policy behavior:
- Correct refusal on disallowed requests
- Proper fallback paths
- Audit logging completeness
Safety should be measured continuously, not only in a one-time red-team exercise.
7. Run controlled rollouts
Use standard rollout controls:
- Shadow mode (model runs but does not affect user output)
- Canary rollout by traffic slice
- Automatic rollback thresholds on error/latency/cost spikes
Keep rollback criteria explicit before rollout begins.
8. Make eval results part of release governance
Treat LLM changes like any high-impact software change:
- PR includes prompt/model/tool diffs
- CI publishes eval report versus baseline
- Release blocked if regression exceeds threshold
- Post-deploy review checks online metrics at fixed intervals
If your process cannot answer "what changed and why quality moved," the system is not under control.
Example release checklist
Before promotion to production:
- Offline eval pass rate meets baseline threshold
- Safety eval set passes hard checks
- Tool-call schema and timeout tests pass
- Canary metrics stable for latency, error, and cost
- Human sample review accepted
This checklist is simple, auditable, and repeatable.
Final note
LLM evaluation is not a single score or a one-time test. It is an operating model: versioned datasets, deterministic gates, real-traffic telemetry, and regular human calibration. Teams that adopt this early move faster with less risk.
Need help with your infrastructure?
Let's discuss your project and find the best solution together.
Get in touchRelated articles
Cloud Architecture Through the Shared Responsibility Model
How to design cloud systems with clear provider/customer boundaries for security, reliability, and operations.
Progressive Delivery in DevOps: Canary, Blue-Green, and Feature Flags
How to reduce deployment risk with progressive delivery patterns and measurable rollback criteria.