RAG Evaluation Framework

Quick Stats

Metrics Supported 12+ (ROUGE, BERTScore, custom)

Human Labeling Integration Built-in UI + CSV export

Benchmark Datasets SQuAD, Natural Questions, custom

GitHub Stars 150+ (open source)

Active Users 50+ companies

Overview

RAG Evaluation Framework is an open-source Python library for comprehensively evaluating retrieval-augmented generation (RAG) pipelines. It combines automated metrics with human-in-the-loop validation to measure retrieval quality, generation quality, and hallucination rates in production RAG systems.

Problem Statement

Evaluating RAG systems is harder than evaluating traditional ML models:

Multiple components: Evaluation must cover retrieval (are we getting the right documents?), ranking (is the ranking good?), and generation (does the LLM produce quality output?)
Automated metrics fail: ROUGE and BERTScore catch obvious issues but miss domain-specific nuances. A claim summary might score high on ROUGE but miss critical coverage exclusions.
Human labels are essential: Need domain experts to validate correctness, but manual labeling is expensive and slow
Production drift: Metrics computed offline differ from real-world performance. Need continuous monitoring.
No standard toolkit: Every company builds their own evaluation framework. Waste of time and expertise.

Solution: Comprehensive Evaluation Framework

Built a production-ready evaluation framework that:

Combines metrics: Automated metrics (ROUGE, BERTScore, custom) + human judgment (labeled datasets)
Reduces labeling burden: Suggests which samples need human review (uncertainty sampling)
Monitors production: Continuous evaluation of incoming queries and responses
Works with any RAG: Framework-agnostic (LangChain, LlamaIndex, custom pipelines)
Integrates with CI/CD: Pytest fixtures for automated testing
Is open source: Free, extensible, community-driven

Architecture & Design

Core Concepts

Evaluation Layers:

Retrieval Evaluation: Did we retrieve the right documents? (Recall@K, precision, NDCG)
Ranking Evaluation: Is the ranking order correct? (Mean Reciprocal Rank, NDCG)
Generation Evaluation: Is the generated response good? (ROUGE, BERTScore, custom)
Hallucination Detection: Does the output contain false claims? (factuality scoring)
Human Judgment: Expert labels for ground truth validation

Key Features

1. Multi-Metric Evaluation

from rag_eval import RAGEvaluator, Metrics

evaluator = RAGEvaluator(
    metrics=[
        Metrics.ROUGE,           # Text similarity
        Metrics.BERTSCORE,       # Semantic similarity
        Metrics.RETRIEVAL_RECALL, # Did we retrieve right docs?
        Metrics.HALLUCINATION,    # False claims detection
        Metrics.CONFIDENCE_SCORE  # Model confidence
    ]
)

results = evaluator.evaluate(
    queries=queries,
    retrieved_docs=retrieved_docs,
    generated_responses=responses,
    ground_truth=gold_answers
)
# Output: comprehensive metrics + confidence scores

2. Human Labeling Integration

Built-in UI for domain experts to label query-response pairs
Uncertainty sampling: automatically suggests which samples most need human review
CSV export/import for integration with other tools
Version control for label changes and audit trails

3. Production Monitoring

Continuous evaluation of live queries and responses
Metric dashboards (accuracy trend, hallucination rate over time)
Alert thresholds: notify when accuracy drops below threshold
Integration with CloudWatch, DataDog for observability

4. Pytest Integration

# Use in your CI/CD pipeline
import pytest
from rag_eval import RAGTestCase

class TestRAGPipeline(RAGTestCase):
    def test_accuracy_above_threshold(self):
        """Ensure accuracy stays above 90%"""
        assert self.evaluator.accuracy >= 0.90

    def test_hallucination_rate_below_threshold(self):
        """Hallucination rate must be below 2%"""
        assert self.evaluator.hallucination_rate <= 0.02

    def test_retrieval_recall(self):
        """Top-5 retrieval should have >80% recall"""
        assert self.evaluator.retrieval_recall_at_k(k=5) >= 0.80

Why This Matters

The Problem with Automated Metrics Alone

Consider a health insurance claim summary:

Claim: "Member has coverage exclusion for elective knee surgery under plan year 2025. Deductible has been met."

LLM Output: "Member coverage for knee surgery is available. Deductible: $1,500."

Automated Metrics: ROUGE might score this 0.72 (decent similarity). BERTScore might score 0.85 (semantically related).

Human Judgment: ❌ WRONG. The output misses the critical "exclusion" detail. In insurance, missing an exclusion is a critical error.

Lesson: Automated metrics are useful but insufficient for domain-specific quality. Domain experts must validate.

The Labeling Cost Problem

Manual labeling is expensive:

~5 minutes per sample (claim expert review)
500 samples × 5 min = ~40 hours (1 week of work)
Cost: $2K–$5K for high-quality labels

Solution: Uncertainty Sampling

The framework identifies which samples most need human review. Instead of labeling all 500, label the 100 most uncertain. Result: 80% of the insights with 20% of the cost.

Impact & Results

Users & Adoption

150+ GitHub stars (open source community)
50+ companies using the framework (5% hallucination detection adoption rate)
Published in NeurIPS workshop on RAG evaluation (2024)

Real-World Results

Companies using the framework reported:

30% faster evaluation: Automated + human labeling is faster than pure manual
50% reduction in hallucinations: Monitoring + guardrails catch issues early
20% improvement in production accuracy: Offline evaluation helps identify issues before production deployment

Open Source & Community

GitHub Repository: github.com/AntonGlenbovitch/rag-eval-framework

Features:

MIT license (free for commercial use)
Well-documented with examples
Active maintenance and community contributions
Pip installable: pip install rag-eval

Related Articles:

Evaluating RAG Systems: Beyond Automated Metrics