Most AI projects we accompany start with the same observation by the client: “We have the knowledge — we just cannot find it again.” It is scattered across wiki pages from three years, in ticket histories, in Word documents on file shares, in emails. Search functions half-work — they find either too much or the wrong thing.
RAG, Retrieval Augmented Generation, is the technology that closes this gap on purpose. But it does not work out of the box. It needs a deliberate architecture.
What RAG technically does
A RAG system consists of three layers:
- Retrieval: relevant text snippets are pulled from indexed sources for a posed question — usually via vector similarity, often combined with classical keyword search.
- Augmentation: the retrieved snippets are placed into the language model’s context — as an attachment to the actual question.
- Generation: the language model formulates an answer based on this context — ideally with source attribution to the snippets used.
The model “knows” nothing about the company — it only knows what is in the current context. What it finds there, it summarises, combines, formulates. What it does not find, it cannot answer — and exactly that is the desired property.
Which sources are suitable
Three source categories are typical and accessible in mid-sized companies:
- Wikis and Confluence-like systems. Structured, often with clear currency markers, well indexable. In experience the most valuable source if maintained.
- Tickets and their resolutions. A gold mine for support and IT applications. If the same question has been asked and solved twelve times, the search via the RAG index knows.
- Documents on file shares. PDF, Word, sometimes Excel. Tied to much more preparation work — structuring, version cleanup, scope assignment — but often the truly binding decisions sit here.
Not every source must land in every RAG index. A RAG system for IT service needs tickets and IT runbooks; a RAG system for accounting needs tax guidelines and contract templates. Splitting into several focused indices is usually better than one large universal index.
Permissions — the underrated sticking point
A RAG system must respect each person’s permissions. If the HR department has access to personnel files and IT does not, the RAG system must not, for an IT person, assemble an answer based on personnel files.
That sounds obvious. In practice it is one of the most common mistakes — because it is technically demanding:
- Every source must carry permission metadata.
- The vector index must evaluate those metadata at retrieval time.
- The language model must also understand that in its answer it must not use information from sources the requester has no access to.
We have seen setups where, for convenience, this separation was omitted — with the argument “all data is in-house anyway”. That is legally risky and organisationally toxic. A RAG system must, from the start, live with the existing permission structure.
What a productive architecture looks like
Simplified, a productive RAG setup consists of:
- Source adapters. One connector per source system (wiki, ticket, file share) that fetches content, normalises it and enriches it with metadata.
- Preparation pipeline. Texts are split into suitable chunks, transformed into vectors, stored in the index. Updates run periodically (daily or on change).
- Retrieval layer. On a question, relevant chunks are returned with permission filtering. Often combined with additional reranking for better hits.
- Generation layer. The language model receives question and retrieved chunks, formulates the answer, returns sources.
- Audit layer. Every request, every retrieved source, every answer is logged — for quality control and compliance.
Such an architecture can run entirely on open-source building blocks in your own data centre. Cloud solutions are possible but bring additional data-protection clarification.
The most common pitfall
From the projects we have seen, the most common stumbling block is source tuning. A RAG index is only as good as the strategy for chunking content. Too small chunks: too little context for a good answer. Too large chunks: too much noise, wrong hits.
There is no universal answer. A productive chunking strategy is tested with real questions in the first four to six weeks and adjusted — that is hand work, but it pays off.
What becomes visible in the end
A well-implemented RAG system turns “we have the knowledge, we just cannot find it” into “we ask, and we get an answer with source citation”. The effect is not spectacular, but deep: staff are faster, new hires reach productivity sooner, professional inconsistencies become visible — because the system reveals them.
That is the productive side of AI in the mid-market. Not loud, but felt.