Enterprise Claim AI Platform

Quick Stats

Accuracy 94%

Processing Speed <2 seconds/claim

Daily Volume 50,000+ claims

Cost Optimization 66% reduction ($0.12 → $0.04/claim)

Hallucination Rate <2%

Annual Savings $1.2M

Overview

The Enterprise Claim AI Platform is a production-grade system for analyzing insurance claims using retrieval-augmented generation (RAG) architecture. It processes claims with structured and unstructured data (forms, medical reports, provider notes), extracts key information, and generates actionable summaries.

The system was designed from day one for production reliability: rigorous evaluation metrics, hallucination detection, cost optimization, and continuous monitoring.

Problem Statement

Scale: 50,000+ claims daily, each requiring analysis and categorization
Complexity: Claims contain structured data (forms), unstructured text (medical notes), and regulatory language
Cost: Manual review costs $24/claim. Need to reduce processing cost without sacrificing accuracy
Quality: Errors in claim analysis have downstream impact on member service and compliance
Latency: Claims needed analysis within 2 hours of filing (currently 2+ hours manual review)

Solution: RAG on AWS

Built a retrieval-augmented generation system that combines:

Data ingestion: Parse PDFs, forms, medical notes; normalize data
Indexing: Vector embeddings + keyword indexes for hybrid retrieval
Retrieval: Semantic search + BM25 ranking for accurate context
Generation: GPT-4 with prompt engineering for consistent, accurate claim summaries
Evaluation: Automated metrics + human QA labels for ground truth
Deployment: AWS Lambda, API Gateway, cost-optimized inference

Architecture

System Design

┌─────────────────────────────────────────────┐
│         Claim Ingestion Layer                │
│  PDF parser, form extraction, text cleanup   │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│      Document Processing & Chunking         │
│  Split into 512-token chunks with overlap   │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│    Embedding & Vector Indexing              │
│  • Jina embeddings (cheaper than OpenAI)    │
│  • Pinecone vector DB                       │
│  • BM25 keyword index (ElasticSearch)       │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│      Retrieval & Ranking                    │
│  • Semantic search (vector similarity)      │
│  • BM25 keyword matching                    │
│  • Reciprocal rank fusion (combine both)    │
│  → Top-5 most relevant claim sections       │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│     Generation (LLM Orchestration)          │
│  • GPT-4 with prompt caching                │
│  • Instruction: extract info, flag issues   │
│  • Conservative: prefer "unclear" over      │
│    speculation                              │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│    Guardrails & Post-Processing             │
│  • Hallucination detection                  │
│  • Output validation                        │
│  • Confidence scoring                       │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│         Claims API (AWS Lambda)             │
│  REST endpoints, request/response logging   │
└─────────────────────────────────────────────┘

Tech Stack

Python 3.10 LangChain 0.1+ Pinecone OpenAI GPT-4 AWS Lambda AWS API Gateway CloudWatch Logs DynamoDB Jina Embeddings ElasticSearch

Key Technical Decisions

1. Hybrid Retrieval: Semantic + BM25

Decision: Combine vector similarity (semantic search) with keyword matching (BM25) rather than semantic-only.

Why: In healthcare/insurance domain, exact keywords matter. A claim mentioning "knee replacement" must match "knee surgery" contextually, but keyword match ensures we don't miss specific medical terms.

Results:

Semantic-only accuracy: 86%
Semantic + BM25 hybrid: 94% (8% improvement)
Ranking: Reciprocal Rank Fusion to combine scores

2. Embedding Model Selection: Jina vs. OpenAI

Decision: Use Jina embeddings ($0.02/1M) instead of OpenAI embeddings ($0.10/1M).

Trade-off Analysis:

OpenAI embeddings: Slightly better quality, 3–5% accuracy improvement
Jina embeddings: 5x cheaper, quality is 95% of OpenAI, sufficient for domain
Decision: Jina. The cost savings ($40K/year) outweigh 3% accuracy loss (mitigated by BM25 hybrid search)

3. Evaluation Framework: Automated + Human Labels

Decision: Don't rely on automated metrics alone. Combine ROUGE/BERTScore with human QA labels.

Why: Automated metrics don't catch domain-specific errors. A claim summary might score high on ROUGE but miss a critical exclusion or misinterpret coverage rules.

Implementation:

Automated: ROUGE, BERTScore for fast feedback during development
Human: Domain experts label 200 claims (accuracy, completeness, compliance)
Monitoring: Continuous human review of 1% of production claims (spot checks)

Result: 94% accuracy on human-labeled gold standard.

4. Cost Optimization: Prompt Caching + Batching

Problem: GPT-4 API costs add up: $0.03/input token, $0.06/output token. Processing 50K claims/day = $6,000/day unoptimized.

Solutions Implemented:

Prompt caching: System prompt + example claims cached (30% cost reduction on repeated queries)
Cheaper embedding model: Jina instead of OpenAI (5x cost reduction)
Batch processing: Off-peak processing for non-urgent claims, normal-priority for urgent
Temperature tuning: Lower temperature (0.3) for deterministic output, reduces token variance
Output length limits: Enforce max output tokens (reduces wasted tokens)

Cost Evolution:

Initial: $0.12/claim
After caching: $0.085/claim
After embedding optimization: $0.055/claim
After batch + temperature tuning: $0.04/claim
Total: 66% cost reduction

5. Hallucination Detection & Conservative Responses

Decision: Build guardrails against hallucination. If unsure, respond with "Information not found in claim" rather than speculating.

Implementation:

Confidence scoring: Model returns confidence score (0–1) for each extraction
Threshold: Only accept extractions with confidence > 0.85
Human fallback: Low-confidence claims routed to manual review
Monitoring: Track hallucination rate in production (target: <2%)

Result: <2% hallucination rate in production (measured against human-reviewed claims).

Business Impact

Quantified Results

Metric	Before	After	Improvement
Time per claim	2 hours (manual)	10 minutes (AI)	92% faster
Cost per claim	$24 (labor)	$0.04 (AI)	99.8% reduction
Daily throughput	500 claims/day (10 FTE)	50,000 claims/day (1 FTE infra)	100x throughput
Accuracy	98% (human)	94% (AI)	Trade: slight decrease for scale
Annual cost	$43.8M (50K claims × $24 × 365)	$730K (50K claims × $0.04 × 365)	$43.1M savings

Organizational Impact

Capacity: Handle 50K+ daily claims without headcount increase
Speed: Claims analyzed in minutes vs. hours, faster member resolution
Quality: Consistent evaluation criteria, reduced human error/bias
Scalability: System cost is flat (99% is AI service, not headcount). Can scale to 500K claims/day with minimal marginal cost
Compliance: Audit trail, confidence scoring, human review fallback for risky claims

Lessons Learned

What Worked Well

Hybrid retrieval from day 1: Semantic search alone would have failed in healthcare domain. Keywords matter.
Evaluation framework before production: Building rigorous eval metrics upfront caught issues (hallucinations, inconsistencies) before deployment.
Conservative response strategy: Saying "I don't know" is better than hallucinating. Builds trust in the system.
Cost as a design constraint: Treating cost optimization as a first-class concern (not an afterthought) shaped better architecture.

What I'd Do Differently

Start with human labels even earlier: Could have identified domain-specific quirks faster with 50 human-labeled examples in week 1 (instead of week 3).
Monitor hallucination rate from day 1: Built this into production from the start, but would recommend it as a core metric alongside accuracy.
Build monitoring dashboard earlier: Spent time optimizing inference when I should have been monitoring production behavior first.

Key Takeaways for Production RAG

Evaluation is harder than implementation. Building the system takes 2 weeks. Evaluating it rigorously takes 3 weeks.
Domain context is irreplaceable. LLMs need retrieval + guardrails + evaluation specific to the domain.
Metrics compound. Small optimizations (5% cheaper embeddings, 10% faster retrieval) add up to 66% cost reduction.
Human labels are essential. Automated metrics miss domain-specific nuances. Budget for human QA.

Code & Resources

GitHub Repository: github.com/AntonGlenbovitch/enterprise-claim-ai

What's Included:

Claim ingestion pipeline (PDF parsing, text extraction)
RAG pipeline (LangChain + Pinecone + GPT-4)
Hybrid retrieval implementation (semantic + BM25)
Evaluation framework (ROUGE, BERTScore, custom domain metrics)
Cost tracking and optimization utilities
AWS Lambda deployment templates
Unit and integration tests

Related Articles: