You think your RAG hallucinates because of the embedding model? Look at your corpus.

Pinecone has just admitted the model is no longer the bottleneck in enterprise RAG. Three numbers point to a different culprit: corpus rot.

On 4 May 2026, Pinecone — the company that made RAG a global infrastructure — published an unusual post. The title: “Better Models Won’t Save Your Agent.” The sentence that stopped me: “the bottleneck is no longer the model, it’s getting agents grounded in the right knowledge” (Pinecone, 4 May 2026, covered by The New Stack). In two sentences, the vendor that has sold more vector retrieval than anyone else publicly states that the lock is no longer where the industry has been looking for three years. It is a rare admission. And it calls for a follow-up no one is yet making: if the lock is no longer the model, and not even the vector — then where has it gone? Let me say what I see every week in customer engagements, and what the numbers released this month validate without quite framing it that way.

On 4 May, Pinecone admitted something unusual

Pinecone did more than announce a product — Nexus, which they describe as a knowledge engine rather than a retrieval system (Pinecone). Pinecone admitted that the vector layer they built their business on no longer suffices the moment you shift from experimental chatbot to enterprise agent in production. The industry responds in cascade. On 12 May, Glean publishes its Agent Development Lifecycle, a seven-stage framework to govern agent fleets (MarTech Series, 12 May 2026). VentureBeat documents, in its Pulse Q1 2026, a clear investment pivot: retrieval optimization becomes the top RAG investment line at 28.9%, ahead of evaluation and ahead of embedding, and intent to adopt hybrid retrieval has tripled in a single quarter, from 10.3% to 33.3% (VentureBeat). Three vendors, three timeframes, one underlying motion: stop patching the model, rebuild the layer underneath.

But none of these players closes the loop. Pinecone shifts the problem to knowledge grounding and proposes a new compiled-context infrastructure. Glean shifts it to agent governance and proposes a lifecycle framework. VentureBeat shifts it to retrieval and watches the budgets move. Nowhere does anyone say what I have been telling the teams I work with for the past three years: your RAG is not hallucinating because you retrieve badly. It is hallucinating because you retrieve very well — from rotten content.

Three numbers no benchmark ever crosses together

Three public measurements. Taken separately, they say little. Taken together, they tell a story I have not seen written anywhere.

First number — 31%. On a naive RAG enterprise stack — standard pipeline with simple vector retrieval, noisy or partial context — an Anthropic-relayed audit documents 31% of responses containing claims unsupported by the corpus. The same study shows that more mature architectures cut that rate: –43% with a Constitutional RAG, –58% with agentic self-correction (generation RAG, 2026). The usual reading: “we need to stack more architectural layers.” The reading that interests me: “there is a 31% residual defect on a typical enterprise corpus before we even touch the model.”

Second number — above 10%. Vectara re-released its Hallucination Leaderboard in 2026, but on a dataset three times larger and significantly harder: 7,700 long articles, up to 32,000 tokens, drawn from law, medicine, finance, technology, education, sports, and news (Vectara, 2026). On this more demanding benchmark, the frontier models — GPT-5, Claude Sonnet 4.5, Grok-4, Gemini-3-Pro — all exceed 10% hallucination rates. Not 1% or 2%. Ten times that. And measured on a clean corpus. In customer engagements, corpora are anything but.

Third number — minus 30%. Chroma Research, in its Context Rot study, tested 18 frontier models (including GPT-4.1, Claude 4, Gemini 2.5, Qwen3) on long contexts. Conclusion: no model consumes its context uniformly. When the key information sits in the middle of the context rather than at the beginning or end, accuracy drops by more than 30%; and a model advertised with a 200,000-token window shows significant degradation as early as 50,000 tokens (Chroma Research). More unexpected still: models perform better on a shuffled haystack than on logically coherent documents — “semantically similar but irrelevant content actively misleads the model.”

The triptych is complete. 31% residual defect on a standard pipeline. Above 10% on frontier models facing long, clean enterprise documents. Minus 30% in accuracy when information lands poorly in the context. None of these numbers says the model is bad. All three say: the model faithfully executes what is presented to it — and what is presented is too noisy, too long, too structurally disorganized to be properly used. The culprit is not the embedding. It is not even the retrieval architecture. It sits upstream.

Chroma named context rot. There is a sibling no one has named yet: corpus rot

Chroma popularized a useful term: context rot. It is the degradation of reasoning when the context injected at inference time becomes too long, too redundant, too unbalanced. Glean picked it up. So did The New Stack and Atlan. It is an inference-side concept.

What is missing is the mirror concept on the source side. I propose a name: corpus rot.

Corpus rot is what happens inside an enterprise document repository when it stops being governed. Three diverging duplicates of the same HR policy, each cited by a different system. A 2019 procedure no one retired, indexed next to the 2026 version. A steering memo that contradicts the official standard without being marked as such. A salesperson’s PowerPoint that ends up, by accident, more semantically similar to a user query than the validated manual — because it reuses the keywords. These are not occasional accidents. On a single document repository, during a first diagnostic engagement with a K-AI client, we routinely detect several hundred such inconsistencies — internal contradictions, duplicates, unmarked obsolescence. And that is one repository, in an organisation that runs dozens.

Corpus rot has a cruel property: it is invisible at the moment the RAG pipeline is built. Everything works in demo, on questions chosen by teams who know their data. It surfaces in production, when real users ask real questions about the organisation’s blind spots. The best AI teams I work with then assign the blame to the embedding, the re-ranker, the chunking, the prompt. They spin for six months. On 4 May, Pinecone politely told them it was not there. What remains is to say where.

The CTO–CDO–Head of KM triangle blames the wrong opponent

In a large organisation, when a RAG hallucinates in production, three roles are alerted. The CTO looks at the stack — embedding, vector, re-ranker, GraphRAG. The CDO looks at the data — pipelines, quality, governance. The Head of Knowledge Management looks at usage — search, adoption, taxonomy. Each sees a slice. No one owns the whole object.

The whole object is what we call a Document Knowledge Platform — the transposition of Data Catalog and Data Mesh patterns to the unstructured layer. A dedicated layer for enterprise document knowledge, treated the way we have been treating structured data for a decade: with a system of record, a semantic graph, contradiction detection, an observability discipline. As long as that layer does not exist, you are asking three roles to solve, each from their own function, a problem that exceeds their individual scope. The outcome is predictable: you patch the embedding, you patch the stack, you patch agent governance — as Glean proposes this week (MarTech Series) — and the hallucination rate does not come down.

What the three numbers say, read together, is that no downstream layer recovers a corpus that rots upstream. Knowledge graphs help — Squirro makes that case usefully (Squirro) — but they do not invent a truth the source documents do not carry. Downstream hallucination detectors, such as Cleanlab’s TLM, identify that a response is false, but not why the corpus made it likely (Cleanlab). These tools are useful. None of them substitutes for governance of the source.

From Start Clean to Stay Clean — continuous corpus monitoring as a product discipline

The operational consequence is less glamorous than swapping an embedding, but it has a better return on investment. It comes in two phases.

Phase one — initial cleanup. Before any serious RAG deployment, audit the target corpus. Detect inter-document contradictions, diverging duplicates, obsolete content, and zones not covered by an authoritative source. Establish a system of record. On the perimeters we have audited, this initial cleanup makes it possible to remove or merge a meaningful share of the document volume — simply because no one had ever done it. It is work the organisation never owned because no team had the explicit mandate.

Phase two — continuous monitoring. Once the corpus is clean, it rots again. Every day, someone produces a new version of a policy without archiving the previous one. Every month, a manual becomes obsolete. Every week, two teams document the same procedure differently. Without dedicated observability, document debt becomes invisible again within a few quarters. The discipline most large organisations are missing, for lack of a better term, is Stay Clean. It is a continuous semantic monitoring of the corpus, detecting the emergence of contradictions, the obsolescence of sources, and freshness drift. It is what we have been doing for a decade on structured data pipelines — and what we have almost never done for documents.

Pinecone admitted on 4 May that the model is no longer the bottleneck. On 13 May, I want to add the logical next step: it is not the vector either, nor even quite the retrieval. It is the material. As long as the corpus is not treated as a governed asset — with an owner, a catalog, a semantic graph, an observability discipline — you can upgrade your stack all you want. Your RAG will keep hallucinating. And you will believe, wrongly, that it is the embedding.

Frequently asked questions

Why does my RAG still hallucinate in production despite a better model and a re-ranker?

Because the residual defect measured on enterprise RAG pipelines is not located in the model or the retrieval layer. On a naive RAG, an Anthropic study measures 31% of responses containing unsupported claims; on long real-world documents, 2026 frontier models still exceed 10% hallucination rates (Vectara). Improving the stack — better embedding, re-ranker, GraphRAG — narrows the gap but does not address the upstream cause: a corpus carrying contradictions, diverging duplicates, and obsolete versions that retrieval will faithfully serve. We call this corpus rot.

What is the difference between context rot and corpus rot?

Context rot, formalised by Chroma Research, is an inference-time phenomenon: response quality degrades when the injected context becomes too long, redundant, or poorly positioned — accuracy drops by more than 30% when the key information sits in the middle of the context (Chroma). Corpus rot is the mirror phenomenon, located upstream, on the source side: an enterprise document repository that is not governed accumulates contradictions, duplicates, and unmarked obsolescence. Context rot degrades reading; corpus rot degrades the material itself. Both matter; in production, the latter dominates.

What metrics should I monitor to qualify a corpus as AI-ready?

Five families, observed the way we observe the quality of a structured-data pipeline. One, the rate of inter-document contradictions detected in the scope (policies, procedures, standards that contradict each other without explicit hierarchy). Two, the rate of diverging duplicates (competing versions of the same document with different content). Three, the rate of unmarked obsolescence (expired documents not retired nor flagged). Four, freshness by corpus segment (mean date of last update). Five, coverage of user intents (zones of demand without an authoritative source). These five metrics, tracked continuously, are the basis of a defensible AI Readiness Score for documents.

Why does Copilot return fewer results than SharePoint search on the same query?

It is a recurring complaint in Microsoft 365 communities — documented cases show 80 results on the SharePoint side versus a handful on the Copilot side for the same query. Multiple causes combine: index synchronisation does not cover all subfolders, certain file types are not recognised as knowledge sources, permission filters differ. But the deeper cause, in the corpora we audit, is simpler: Copilot only serves a subset of the corpus and has no way to signal that the documents it does not retrieve might be relevant. On an ungoverned repository, that subset can be radically poorer than the actual base.

Do knowledge graphs really solve RAG hallucinations?

Partially. An enterprise knowledge graph adds a deterministic layer on top of vector retrieval — it can require that an answer be consistent with a verified entity-and-relation schema. Squirro and others document the gains (Squirro). But a graph does not create a truth that does not exist in the source documents. If two documents say two contradictory things, the graph will, at best, reflect the conflict; at worst, reproduce the most represented version. A knowledge graph’s quality is capped by the quality of its extraction corpus. Hence the Start Clean, Stay Clean logic upstream.

Going further

If you recognise the situation I am describing — a production RAG pipeline that will not drop below an unacceptable error rate despite several optimisation cycles — the useful next step is not a new embedding. It is an audit of the corpus it consumes. We do this for large enterprises on scoped perimeters. Reach us at contact@k-ai.ai.

Sources cited

Pinecone, Better Models Won’t Save Your Agent and Pinecone Nexus: The Knowledge Engine for Agents, 4 May 2026 — https://www.pinecone.io/blog/introducing-nexus-knowledge-engine/
The New Stack, The company that made RAG mainstream is now betting against it, May 2026 — https://thenewstack.io/pinecone-nexus-rag-obsolete/
Chroma Research, Context Rot: How Increasing Input Tokens Impacts LLM Performance, 2025-2026 — https://www.trychroma.com/research/context-rot
Vectara, Introducing the Next Generation of Vectara’s Hallucination Leaderboard, 2026 — https://www.vectara.com/blog/introducing-the-next-generation-of-vectaras-hallucination-leaderboard
generation RAG, The Ugly Truth About Enterprise RAG Anthropic Just Quantified, 2026 — https://ragaboutit.com/the-ugly-truth-about-enterprise-rag-anthropic-just-quantified/
VentureBeat, The Retrieval Rebuild, Q1 2026 — https://venturebeat.com/data/the-retrieval-rebuild-why-hybrid-retrieval-intent-tripled-as-enterprise-rag-programs-hit-the-scale-wall
MarTech Series, Glean Introduces the Enterprise Agent Development Lifecycle, 12 May 2026 — https://martechseries.com/predictive-ai/ai-platforms-machine-learning/glean-introduces-the-enterprise-agent-development-lifecycle-codifying-how-enterprises-build-govern-and-measure-ai-agents/
Squirro, How Knowledge Graphs Bridge the Gap in Enterprise AI, 5 March 2026 — https://squirro.com/squirro-blog/how-do-knowledge-graphs-bridge-the-gap-in-enterprise-ai
Cleanlab, Benchmarking Hallucination Detection in RAG, 2026 — https://cleanlab.ai/blog/rag-tlm-hallucination-benchmarking/

AI Act, D-82: why a “dirty” document corpus makes your high-risk AI indefensible — the legal angle on the same problem, ahead of the AI Act deadline.
You don’t have 50 AI agents. You will — and they will all share the same corpus — the multiplier effect of document debt as agents proliferate.

K-AI already supports CMA CGM, Veolia, PwC, BNP Paribas, TotalEnergies, and CEVA Logistics. Partners: AWS, Snowflake, Microsoft, Wavestone, Devoteam.