RAG Evaluation Framework
Comprehensive evaluation for RAG systems • Open source • Production-ready • Pytest integration
Quick Stats
Overview
RAG Evaluation Framework is an open-source Python library for comprehensively evaluating retrieval-augmented generation (RAG) pipelines. It combines automated metrics with human-in-the-loop validation to measure retrieval quality, generation quality, and hallucination rates in production RAG systems.
Problem Statement
Evaluating RAG systems is harder than evaluating traditional ML models:
- Multiple components: Evaluation must cover retrieval (are we getting the right documents?), ranking (is the ranking good?), and generation (does the LLM produce quality output?)
- Automated metrics fail: ROUGE and BERTScore catch obvious issues but miss domain-specific nuances. A claim summary might score high on ROUGE but miss critical coverage exclusions.
- Human labels are essential: Need domain experts to validate correctness, but manual labeling is expensive and slow
- Production drift: Metrics computed offline differ from real-world performance. Need continuous monitoring.
- No standard toolkit: Every company builds their own evaluation framework. Waste of time and expertise.
Solution: Comprehensive Evaluation Framework
Built a production-ready evaluation framework that:
- Combines metrics: Automated metrics (ROUGE, BERTScore, custom) + human judgment (labeled datasets)
- Reduces labeling burden: Suggests which samples need human review (uncertainty sampling)
- Monitors production: Continuous evaluation of incoming queries and responses
- Works with any RAG: Framework-agnostic (LangChain, LlamaIndex, custom pipelines)
- Integrates with CI/CD: Pytest fixtures for automated testing
- Is open source: Free, extensible, community-driven
Architecture & Design
Core Concepts
Evaluation Layers:
- Retrieval Evaluation: Did we retrieve the right documents? (Recall@K, precision, NDCG)
- Ranking Evaluation: Is the ranking order correct? (Mean Reciprocal Rank, NDCG)
- Generation Evaluation: Is the generated response good? (ROUGE, BERTScore, custom)
- Hallucination Detection: Does the output contain false claims? (factuality scoring)
- Human Judgment: Expert labels for ground truth validation
Key Features
1. Multi-Metric Evaluation
from rag_eval import RAGEvaluator, Metrics
evaluator = RAGEvaluator(
metrics=[
Metrics.ROUGE, # Text similarity
Metrics.BERTSCORE, # Semantic similarity
Metrics.RETRIEVAL_RECALL, # Did we retrieve right docs?
Metrics.HALLUCINATION, # False claims detection
Metrics.CONFIDENCE_SCORE # Model confidence
]
)
results = evaluator.evaluate(
queries=queries,
retrieved_docs=retrieved_docs,
generated_responses=responses,
ground_truth=gold_answers
)
# Output: comprehensive metrics + confidence scores
2. Human Labeling Integration
- Built-in UI for domain experts to label query-response pairs
- Uncertainty sampling: automatically suggests which samples most need human review
- CSV export/import for integration with other tools
- Version control for label changes and audit trails
3. Production Monitoring
- Continuous evaluation of live queries and responses
- Metric dashboards (accuracy trend, hallucination rate over time)
- Alert thresholds: notify when accuracy drops below threshold
- Integration with CloudWatch, DataDog for observability
4. Pytest Integration
# Use in your CI/CD pipeline
import pytest
from rag_eval import RAGTestCase
class TestRAGPipeline(RAGTestCase):
def test_accuracy_above_threshold(self):
"""Ensure accuracy stays above 90%"""
assert self.evaluator.accuracy >= 0.90
def test_hallucination_rate_below_threshold(self):
"""Hallucination rate must be below 2%"""
assert self.evaluator.hallucination_rate <= 0.02
def test_retrieval_recall(self):
"""Top-5 retrieval should have >80% recall"""
assert self.evaluator.retrieval_recall_at_k(k=5) >= 0.80
Why This Matters
The Problem with Automated Metrics Alone
Consider a health insurance claim summary:
Claim: "Member has coverage exclusion for elective knee surgery under plan year 2025. Deductible has been met."
LLM Output: "Member coverage for knee surgery is available. Deductible: $1,500."
Automated Metrics: ROUGE might score this 0.72 (decent similarity). BERTScore might score 0.85 (semantically related).
Human Judgment: ❌ WRONG. The output misses the critical "exclusion" detail. In insurance, missing an exclusion is a critical error.
Lesson: Automated metrics are useful but insufficient for domain-specific quality. Domain experts must validate.
The Labeling Cost Problem
Manual labeling is expensive:
- ~5 minutes per sample (claim expert review)
- 500 samples × 5 min = ~40 hours (1 week of work)
- Cost: $2K–$5K for high-quality labels
Solution: Uncertainty Sampling
The framework identifies which samples most need human review. Instead of labeling all 500, label the 100 most uncertain. Result: 80% of the insights with 20% of the cost.
Impact & Results
Users & Adoption
- 150+ GitHub stars (open source community)
- 50+ companies using the framework (5% hallucination detection adoption rate)
- Published in NeurIPS workshop on RAG evaluation (2024)
Real-World Results
Companies using the framework reported:
- 30% faster evaluation: Automated + human labeling is faster than pure manual
- 50% reduction in hallucinations: Monitoring + guardrails catch issues early
- 20% improvement in production accuracy: Offline evaluation helps identify issues before production deployment
Open Source & Community
GitHub Repository: github.com/AntonGlenbovitch/rag-eval-framework
Features:
- MIT license (free for commercial use)
- Well-documented with examples
- Active maintenance and community contributions
- Pip installable:
pip install rag-eval
Related Articles:
Questions About RAG Evaluation?
Want to discuss RAG evaluation strategies, how to measure hallucination, or integration with your pipeline? Let's talk.