RAGAS, DeepEval, LettuceDetect: Why Standard RAG Evaluation Frameworks Are Blind to Corpus-Level Failures
RAGAS, DeepEval, LettuceDetect measure fidelity to what was retrieved — not the reliability of your corpus. The blind spot in your RAG evaluation.
By 2026, a consensus has settled in AI engineering teams: any serious RAG pipeline deployment should include structured evaluation. RAGAS, DeepEval, TruLens, LettuceDetect — measurement frameworks have proliferated, teams have adopted them, and faithfulness dashboards have become standard fixtures. Yet an analysis published on June 10, 2026 by ragaboutit.com — aggregating nine studies on enterprise RAG — highlights a structural finding: according to a 2025 paper by researchers at Renmin University and Tencent, 67% of hallucination instances in conversational RAG systems are extractive (ragaboutit.com, June 10, 2026). In other words, in the majority of cases, the model is not fabricating — it is faithfully reproducing incorrect content from the retrieved corpus. The pipeline mechanics are sound. The corpus is not. This distinction points to a question few teams ask themselves: can the tools we use to evaluate our pipeline actually detect problems that originate in the corpus itself?
What RAGAS, DeepEval, and LettuceDetect Measure — and What They Assume
The leading RAG evaluation frameworks share a common architectural logic: they measure pipeline quality conditional on the retrieved documents. In other words, they assess whether the system worked well with what it was given — they do not question what it was given.
RAGAS, introduced in 2023 by Shahul Es et al. (arXiv:2309.15217), defines four core metrics: faithfulness (is each claim in the answer attributable to retrieved passages?), answer relevance (does the answer address the question?), context precision (are retrieved passages relevant?), and context recall (were the essential passages retrieved?). DeepEval, from Confident AI, extends this scope with consistency, toxicity, and bias metrics, but operates on the same logic: the reference is the injected context, not external ground truth.
LettuceDetect takes a different approach: it uses ModernBERT for Natural Language Inference (NLI) and checks whether the claims in an answer are entailed by the retrieved passages. This is a robust mechanism — it can detect, for instance, that an answer cites a percentage the source document does not contain. TruLens proposes the “RAG Triad”: groundedness, QA relevance, context relevance — same posture: the retrieved corpus is the reference.
The shared assumption is invisible because it is reasonable: if a document is in your corpus, it was judged worthy of being there. What these frameworks cannot measure is whether that assumption holds.
The Common Gap: Pipeline Metrics Assume a Coherent Corpus
Consider a concrete scenario. A large financial organization deploys a RAG assistant over its HR policy repository — several hundred documents distributed across SharePoint, Confluence, and a legacy document management system. After three weeks of testing, its RAGAS faithfulness score is 0.97. Satisfied, the team moves to production.
What RAGAS measured: across 200 test questions, 97% of the claims in generated answers were attributable to at least one retrieved passage. That is correct. And it is incomplete.
What RAGAS did not measure: in this corpus, a diagnostic exercise conducted on a comparable enterprise document repository uncovered that roughly one in eight anomalies corresponds to a situation where two formally valid documents contradict each other on a threshold, a date, or a rate — an April 2025 policy note and an October 2025 update that both apply, with different values for the same rule. When the RAG agent retrieves the April version, it answers faithfully to that document. It scores 1.0 on that evaluation point. And it gives the wrong answer.
This is precisely what Sinequa documents in its June 2026 survey of 740 executives: 38.4% of organizations identify “data that fails to update” as the primary cause of RAG deployment failure (Sinequa, Beyond the Hype: The Reality of Enterprise Agentic AI in 2026, June 2, 2026). That 38.4% will never surface in a RAGAS score, because RAGAS measures the pipeline, not the corpus.
Faithfulness at 97% on a Contradictory Corpus: How a Score Can Be Correct and Misleading at the Same Time
The tension is formally resolvable: a RAGAS score of 0.97 can be rigorously accurate and indicate that a system is correctly answering questions with outdated or contradictory information. This is not a flaw in RAGAS — it is the logical consequence of a faithfulness definition that cannot extend beyond the injected context.
Consider the most common scenario we observe in enterprise document corpora: two versions of the same internal regulation coexist, both indexed, with different update dates but no explicit invalidation of the earlier version. A hybrid retriever may return either one, depending on the query. RAGAS then asks: “Is the claim in the answer supported by what was retrieved?” The answer is yes in both cases. Score: 1.0 in both cases. Results for the end user: potentially contradictory.
LettuceDetect, with its NLI approach, is even more precise in handling this case: if the retrieved document contains the old value and the answer correctly cites that value, there is no detectable hallucination. The problem is not in the pipeline — it is in the corpus.
This is the distinction Gabriel Anhaia articulates in his April 2026 analysis: 70 to 80% of enterprise RAG deployments never reach stable production, and in 73% of cases the failure is retriever-side — but “retriever” here includes the quality of what there is to retrieve, not only the retrieval mechanics (dev.to/gabrielanhaia, April 26, 2026).
Three Classes of Corpus-Level Failures That Standard Evaluation Cannot See
Enterprise document corpus work consistently surfaces three families of problems that RAGAS, DeepEval, and LettuceDetect structurally cannot detect:
1. Cross-document contradictions. Two documents covering the same topic provide incompatible information. The retriever may return either one depending on the query vector. Both cases produce a perfect faithfulness score, but the final answers contradict each other. This failure class was examined in more depth in our R&D Note of May 27, 2026 on RAG failure modes — but its relevance to the evaluation layer remains underexplored.
2. Unmarked obsolescence. A document exists, is indexed, is retrievable — but it describes a state of the world that has since changed. The content is not wrong in its original context, but it is stale in the context of a question asked today. RAGAS cannot know this: it has no access to external ground truth. A faithfulness score of 1.0 coexists perfectly with an answer built on a policy revised twelve months ago.
3. Divergent duplicates. The same document exists in two slightly different versions in the corpus — one imported from SharePoint, one manually copied into Confluence, with a value that differs between the two versions. The retriever returns one or the other depending on how the query is phrased. The system is non-deterministic for questions that should have a single answer. No standard RAG evaluation metric measures the response variance that arises from divergent duplicates in the corpus.
These three classes correspond to three of the six audit axes that K-AI instruments before any AI deployment (see our 6-axis audit methodology, May 15, 2026).
Corpus Audit as Step Zero Before Evaluating Your RAG Pipeline
The practical conclusion is not that RAGAS, DeepEval, or LettuceDetect are flawed tools — they are well-designed tools that do what they claim. The conclusion is that their use presupposes a preliminary step that most teams have not yet formalized: verifying that the corpus they are applied to is itself coherent.
The correct sequence should be:
Step 0 — Corpus audit: mapping of cross-document contradictions, detection of divergent duplicates, identification of unmarked obsolete content, thematic coverage measurement, traceability verification.
Step 1 — Pipeline configuration: chunking, embedding, hybrid retrieval strategy, reranking.
Step 2 — Pipeline evaluation: RAGAS, DeepEval, LettuceDetect — measuring faithfulness, relevance, and consistency.
In practice, teams skip Step 0 because it is not tooled by the evaluation frameworks themselves. RAGAS will not tell you to audit before running it — and rightly so, that is not its role. But the absence of built-in tooling does not make the step optional.
The clearest signal of this gap comes from the market itself: one year after the publication of the foundational RAGOps paper (arXiv:2506.03401, Xu et al., CSIRO Data61 / TU Munich, 2025), which explicitly defines continuous corpus management as a structural component of RAGOps, no evaluation framework vendor has integrated corpus health metrics into its standard scope. This is not an oversight — it is a boundary of responsibility. Teams need to cross it themselves.
K-AI already works with CMA CGM, Veolia, PwC, BNP Paribas, TotalEnergies, and CEVA Logistics. Partners: AWS, Snowflake, Microsoft, Wavestone, Devoteam.
Frequently Asked Questions
How do you audit a RAG pipeline to detect incorrect or fabricated sources?
Auditing a RAG pipeline requires two distinct phases. Upstream, a corpus audit verifies the internal consistency of the corpus: cross-document contradictions, divergent duplicates, unmarked obsolete content, thematic coverage, traceability. Downstream, pipeline evaluation frameworks (RAGAS, DeepEval, LettuceDetect) measure the fidelity of answers to retrieved passages. Most teams only perform the downstream audit — which is necessary but insufficient. An answer can be faithful to an outdated document and achieve a perfect score while being factually wrong in the organization’s current context.
Why does my RAG system hallucinate despite using reliable internal documents?
When hallucinations persist in a well-configured RAG pipeline, the cause is rarely the model or the retrieval mechanics. It lies in the corpus: two versions of a document coexist with contradictory values, or a reference document was not updated when the policy it describes changed. The user asks a question, the retriever returns the most semantically similar document (which may be the outdated version), the LLM answers faithfully to that document — and RAGAS validates the response with a high score. The pipeline chain is sound. The corpus is not.
What is the difference between a high RAGAS score and a reliable RAG in production?
RAGAS measures the internal consistency of the pipeline: are the claims in the answer attributable to the retrieved context? A RAGAS faithfulness score of 0.97 means that 97% of claims are supported by the passages injected into context. But it says nothing about the external truth of the passages themselves. A reliable RAG in production requires two conditions: a pipeline that answers faithfully to what it retrieves (measured by RAGAS) and a corpus whose documents are consistent, current, and free of contradictions (measured by a corpus audit).
Is LettuceDetect more effective than RAGAS for detecting enterprise RAG hallucinations?
LettuceDetect and RAGAS measure slightly different things. LettuceDetect, based on ModernBERT and Natural Language Inference, precisely detects whether a claim is entailed by the source passages — it is particularly well-suited to detecting misattributed figures or dates. RAGAS takes a broader approach across the full answer. But both share the same structural limitation: they evaluate fidelity to the retrieved context. If the retrieved document itself contains an error or contradictory information, neither RAGAS nor LettuceDetect can detect it — that is the scope of the upstream corpus audit.
Should you audit your document corpus before connecting a RAG to your internal data?
Yes. A corpus audit is Step Zero before any serious enterprise RAG deployment, for three reasons. First, pipeline evaluation frameworks (RAGAS, DeepEval, LettuceDetect) assume a coherent corpus — they measure quality conditional on corpus health. Second, failures arising from contradictions, divergent duplicates, or obsolete content are not visible in standard pipeline metrics. Third, in regulated environments, a traceable corpus audit is a component of the technical documentation requirement under EU AI Act Article 12.
Go Further
If you want to map the documentary failures in your corpus before your next RAG evaluation cycle, we conduct an initial diagnostic within 48 hours. Contact us at contact@k-ai.ai.
Sources Cited
- Shahul Es et al., RAGAS: Automated Evaluation of Retrieval Augmented Generation, arXiv:2309.15217, September 2023 — https://arxiv.org/abs/2309.15217
- Gabriel Anhaia, 70% of Enterprise RAG Deployments Fail Before Production, dev.to, April 26, 2026 — https://dev.to/gabrielanhaia/70-of-enterprise-rag-deployments-fail-before-production-heres-what-kills-them-26ml
- Sinequa, Beyond the Hype: The Reality of Enterprise Agentic AI in 2026, June 2, 2026 — https://www.sinequa.com/resources/blog/beyond-the-hype-the-reality-of-enterprise-agentic-ai-in-2026/
- Xu et al. (CSIRO Data61 / TU Munich), RAGOps: Operating and Managing Retrieval-Augmented Generation Pipelines, arXiv:2506.03401, 2025 — https://arxiv.org/abs/2506.03401
- ragaboutit.com (David Richards), 9 RAG Benchmarks Prove 67% Hallucination Still Ships, June 10, 2026 — https://ragaboutit.com/9-rag-benchmarks-prove-67-hallucination-still-ships/ — cites: Renmin University / Tencent, On the Hallucination in Conversational RAG, 2025
- Vectara, Introducing the Next Generation of Vectara’s Hallucination Leaderboard, November 2025 — https://www.vectara.com/blog/introducing-the-next-generation-of-vectaras-hallucination-leaderboard
Related Posts
- RAG Doesn’t Solve Hallucination — It Moves It: The Cross-Source Contradiction Nobody Talks About (May 27, 2026)
- Auditing an Enterprise Document Corpus for AI — K-AI’s 6-Axis Method (May 15, 2026)
- RAGOps: The Data Half Nobody Operates — SLIs, SLOs, and a Control Plane for Corpus Health (June 5, 2026)
