The most common disappointment in AI projects sounds like: “The system gives wrong answers.” On closer inspection the cause is almost never the model, but the data source. The model hallucinated because the right information was not in the indexed data — or because several contradictory versions were present and none flagged as authoritative.
Data is the invisible but decisive half of any serious AI application.
What makes a good knowledge source
“Good” is not a vague term in the data-source context. Six concrete properties distinguish productive knowledge sources from sources that cause more problems than they solve:
- Currency. When was the information last updated? If the document is four years old and the world has changed since, it is a risk, not an asset.
- Authoritativeness. Who approved the document? Is it the official version or a draft that was never withdrawn? A source without a clear approval trail produces contradictory answers.
- Unambiguous statements. Does the text say clearly what holds — or is it deliberately open to leave room for interpretation? AI systems love clear statements and stumble on interpretable ones.
- Source citations. What does the statement rest on? A standard, an internal decision, an own calculation? Without a source it cannot be checked whether the information still holds.
- Structure. Is the document organised, with headings, lists, clearly separated topics? Long unstructured prose is harder to chunk correctly than well-structured text.
- Boundaries. What is in the document — and what is not? When a document seems to answer a question but only partially covers it, the gap is dangerous.
In most companies fewer than 30 per cent of existing knowledge sources meet these six criteria. The remaining 70 per cent are not unused — but they become a source of hallucination once AI indexes them.
What an audit of own sources looks like
Before a RAG system is set up, a source audit is worth doing. It is not extensive but structured:
- Inventory: which sources are eligible — wiki, SharePoint, file server, email archive, ticket system, external documents?
- Assessment: every source is checked against the six criteria. Result is a four-quadrant matrix: “index unrestricted / index with caveat / prepare then index / do not index”.
- Preparation: the middle two categories get work. Out-of-date documents are updated or archived. Contradictory versions are consolidated to one authoritative form. Unstructured texts are organised.
- Classification: sources get metadata — date, approver, scope, sensitivity. These metadata are used in the RAG system to place answers in context and filter them.
This step costs time — but it is the difference between a RAG system the company trusts and one that nobody uses after three weeks.
Why mid-sized companies have much to gain
Mid-sized companies often have an underestimated advantage: they have knowledge sources that, in larger groups, long since disappeared into inaccessible specialist tools. Wikis, file shares, well-maintained Confluence spaces, well-run ticket systems. That is usable material once it has been cleanly prepared.
What they usually do not have are the master-data-management bureaucracies and data-protection architectures that, in larger groups, slow preparation. A mid-sized firm can complete the audit in four to six weeks — a group needs a year for the same.
What changes after the audit
The most common side effect of an honest source audit is not the AI system, but the realisation of how much clarity the company has been missing. Which decisions actually applied? Which guidelines were obsolete? Which topics were so differently documented that several conflicting truths circulated internally?
This clarity would have emerged without AI too — but probably through a painful incident. AI projects force it to emerge gently.
What remains in the end
An AI application with good data sources produces reliable answers that move the company forward. An AI application with unprepared sources produces a few convincingly worded hallucinations — and poisons trust in the whole technology for years.
Whoever wants to work seriously with AI does not start with the model, but with the sources. That sequence looks slower and is not.