← Back to Projects

Enterprise Claim AI Platform

Production RAG system for insurance claim analysis • Health insurance domain • 50K+ daily transactions

Overview

The Enterprise Claim AI Platform is a production-grade system for analyzing insurance claims using retrieval-augmented generation (RAG) architecture. It processes claims with structured and unstructured data (forms, medical reports, provider notes), extracts key information, and generates actionable summaries.

The system was designed from day one for production reliability: rigorous evaluation metrics, hallucination detection, cost optimization, and continuous monitoring.

Problem Statement

  • Scale: 50,000+ claims daily, each requiring analysis and categorization
  • Complexity: Claims contain structured data (forms), unstructured text (medical notes), and regulatory language
  • Cost: Manual review costs $24/claim. Need to reduce processing cost without sacrificing accuracy
  • Quality: Errors in claim analysis have downstream impact on member service and compliance
  • Latency: Claims needed analysis within 2 hours of filing (currently 2+ hours manual review)

Solution: RAG on AWS

Built a retrieval-augmented generation system that combines:

  • Data ingestion: Parse PDFs, forms, medical notes; normalize data
  • Indexing: Vector embeddings + keyword indexes for hybrid retrieval
  • Retrieval: Semantic search + BM25 ranking for accurate context
  • Generation: GPT-4 with prompt engineering for consistent, accurate claim summaries
  • Evaluation: Automated metrics + human QA labels for ground truth
  • Deployment: AWS Lambda, API Gateway, cost-optimized inference

Architecture

System Design

┌─────────────────────────────────────────────┐
│         Claim Ingestion Layer                │
│  PDF parser, form extraction, text cleanup   │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│      Document Processing & Chunking         │
│  Split into 512-token chunks with overlap   │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│    Embedding & Vector Indexing              │
│  • Jina embeddings (cheaper than OpenAI)    │
│  • Pinecone vector DB                       │
│  • BM25 keyword index (ElasticSearch)       │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│      Retrieval & Ranking                    │
│  • Semantic search (vector similarity)      │
│  • BM25 keyword matching                    │
│  • Reciprocal rank fusion (combine both)    │
│  → Top-5 most relevant claim sections       │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│     Generation (LLM Orchestration)          │
│  • GPT-4 with prompt caching                │
│  • Instruction: extract info, flag issues   │
│  • Conservative: prefer "unclear" over      │
│    speculation                              │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│    Guardrails & Post-Processing             │
│  • Hallucination detection                  │
│  • Output validation                        │
│  • Confidence scoring                       │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│         Claims API (AWS Lambda)             │
│  REST endpoints, request/response logging   │
└─────────────────────────────────────────────┘

Tech Stack

Python 3.10 LangChain 0.1+ Pinecone OpenAI GPT-4 AWS Lambda AWS API Gateway CloudWatch Logs DynamoDB Jina Embeddings ElasticSearch

Key Technical Decisions

1. Hybrid Retrieval: Semantic + BM25

Decision: Combine vector similarity (semantic search) with keyword matching (BM25) rather than semantic-only.

Why: In healthcare/insurance domain, exact keywords matter. A claim mentioning "knee replacement" must match "knee surgery" contextually, but keyword match ensures we don't miss specific medical terms.

Results:

  • Semantic-only accuracy: 86%
  • Semantic + BM25 hybrid: 94% (8% improvement)
  • Ranking: Reciprocal Rank Fusion to combine scores

2. Embedding Model Selection: Jina vs. OpenAI

Decision: Use Jina embeddings ($0.02/1M) instead of OpenAI embeddings ($0.10/1M).

Trade-off Analysis:

  • OpenAI embeddings: Slightly better quality, 3–5% accuracy improvement
  • Jina embeddings: 5x cheaper, quality is 95% of OpenAI, sufficient for domain
  • Decision: Jina. The cost savings ($40K/year) outweigh 3% accuracy loss (mitigated by BM25 hybrid search)

3. Evaluation Framework: Automated + Human Labels

Decision: Don't rely on automated metrics alone. Combine ROUGE/BERTScore with human QA labels.

Why: Automated metrics don't catch domain-specific errors. A claim summary might score high on ROUGE but miss a critical exclusion or misinterpret coverage rules.

Implementation:

  • Automated: ROUGE, BERTScore for fast feedback during development
  • Human: Domain experts label 200 claims (accuracy, completeness, compliance)
  • Monitoring: Continuous human review of 1% of production claims (spot checks)

Result: 94% accuracy on human-labeled gold standard.

4. Cost Optimization: Prompt Caching + Batching

Problem: GPT-4 API costs add up: $0.03/input token, $0.06/output token. Processing 50K claims/day = $6,000/day unoptimized.

Solutions Implemented:

  • Prompt caching: System prompt + example claims cached (30% cost reduction on repeated queries)
  • Cheaper embedding model: Jina instead of OpenAI (5x cost reduction)
  • Batch processing: Off-peak processing for non-urgent claims, normal-priority for urgent
  • Temperature tuning: Lower temperature (0.3) for deterministic output, reduces token variance
  • Output length limits: Enforce max output tokens (reduces wasted tokens)

Cost Evolution:

  • Initial: $0.12/claim
  • After caching: $0.085/claim
  • After embedding optimization: $0.055/claim
  • After batch + temperature tuning: $0.04/claim
  • Total: 66% cost reduction

5. Hallucination Detection & Conservative Responses

Decision: Build guardrails against hallucination. If unsure, respond with "Information not found in claim" rather than speculating.

Implementation:

  • Confidence scoring: Model returns confidence score (0–1) for each extraction
  • Threshold: Only accept extractions with confidence > 0.85
  • Human fallback: Low-confidence claims routed to manual review
  • Monitoring: Track hallucination rate in production (target: <2%)

Result: <2% hallucination rate in production (measured against human-reviewed claims).

Business Impact

Quantified Results

Metric Before After Improvement
Time per claim 2 hours (manual) 10 minutes (AI) 92% faster
Cost per claim $24 (labor) $0.04 (AI) 99.8% reduction
Daily throughput 500 claims/day (10 FTE) 50,000 claims/day (1 FTE infra) 100x throughput
Accuracy 98% (human) 94% (AI) Trade: slight decrease for scale
Annual cost $43.8M (50K claims × $24 × 365) $730K (50K claims × $0.04 × 365) $43.1M savings

Organizational Impact

  • Capacity: Handle 50K+ daily claims without headcount increase
  • Speed: Claims analyzed in minutes vs. hours, faster member resolution
  • Quality: Consistent evaluation criteria, reduced human error/bias
  • Scalability: System cost is flat (99% is AI service, not headcount). Can scale to 500K claims/day with minimal marginal cost
  • Compliance: Audit trail, confidence scoring, human review fallback for risky claims

Lessons Learned

What Worked Well

  • Hybrid retrieval from day 1: Semantic search alone would have failed in healthcare domain. Keywords matter.
  • Evaluation framework before production: Building rigorous eval metrics upfront caught issues (hallucinations, inconsistencies) before deployment.
  • Conservative response strategy: Saying "I don't know" is better than hallucinating. Builds trust in the system.
  • Cost as a design constraint: Treating cost optimization as a first-class concern (not an afterthought) shaped better architecture.

What I'd Do Differently

  • Start with human labels even earlier: Could have identified domain-specific quirks faster with 50 human-labeled examples in week 1 (instead of week 3).
  • Monitor hallucination rate from day 1: Built this into production from the start, but would recommend it as a core metric alongside accuracy.
  • Build monitoring dashboard earlier: Spent time optimizing inference when I should have been monitoring production behavior first.

Key Takeaways for Production RAG

  1. Evaluation is harder than implementation. Building the system takes 2 weeks. Evaluating it rigorously takes 3 weeks.
  2. Domain context is irreplaceable. LLMs need retrieval + guardrails + evaluation specific to the domain.
  3. Metrics compound. Small optimizations (5% cheaper embeddings, 10% faster retrieval) add up to 66% cost reduction.
  4. Human labels are essential. Automated metrics miss domain-specific nuances. Budget for human QA.

Code & Resources

GitHub Repository: github.com/AntonGlenbovitch/enterprise-claim-ai

What's Included:

  • Claim ingestion pipeline (PDF parsing, text extraction)
  • RAG pipeline (LangChain + Pinecone + GPT-4)
  • Hybrid retrieval implementation (semantic + BM25)
  • Evaluation framework (ROUGE, BERTScore, custom domain metrics)
  • Cost tracking and optimization utilities
  • AWS Lambda deployment templates
  • Unit and integration tests

Related Articles:

Have Questions?

Want to discuss RAG architecture, production AI systems, or health insurance domain challenges? Let's talk.

Email: a.glenbovitch@gmail.com