01 — The Problem
Knowledge trapped in documents nobody reads
Most organizations have policies, procedures, and handbooks that employees are expected to know — but practically never read. When someone needs an answer, they either send an email to HR, guess, or give up. The documents exist. The knowledge is there. It's just completely inaccessible in practice.
The average employee handbook runs 40–80 pages. Nobody is searching a PDF to find out how many sick days they get. The information gap isn't a training problem — it's an access problem. Employees need cited, instant answers from documents they can trust, without needing to know where to look or what to search for.
But here's the real problem that almost nobody talks about: the AI isn't what's broken. The data is. And that's exactly what this prototype was built to expose.
02 — Why AI
Why this needed RAG, not a search bar
A traditional keyword search finds documents that contain the words you typed. If you search "sick days," it returns every page that mentions the phrase. That's not useful when policy documents use phrases like "paid absence allowance" or "medical leave entitlement" — terminology employees don't know to look for.
Retrieval-Augmented Generation (RAG) works differently. It converts both the documents and the question into vector embeddings — numerical representations of meaning, not just words. The system finds chunks of text that are semantically similar to the question, even when the exact words don't match. Then it passes those relevant chunks to a language model that synthesizes a direct answer, with citations pointing back to the source.
For HR policy documents specifically, this matters because employees don't know the language of the policies. They know their situation. RAG bridges that gap — it understands what they mean, not just what they said.
03 — The Hard Part
Where most RAG projects fail before they start
The AI is not the problem. The data is. Most enterprise RAG demos use clean, pre-formatted documents. Real organizations hand you scanned PDFs, version-chaos Word docs, and files with zero metadata. This prototype was built to show exactly that — and what it takes to actually solve it.
To make this concrete, I built four intentionally difficult test documents — the kind you actually encounter inside organizations:
The difference in answer quality between document 3 and document 4 is not a model problem. It's entirely a data problem. That's the point this prototype proves.
04 — Architecture
How the pipeline works
The system has two distinct phases: ingestion and query. They're independent — you build the index once, then query it many times.
Phase 1 — Ingestion
Phase 2 — Query
Both phases run through N8N, which makes the workflow visual, debuggable, and easy to explain to a non-technical stakeholder. The ingestion workflow runs once per document batch. The query workflow runs on every user question.
05 — Chunking
Why recursive character splitting at 512 tokens
Chunking is one of the most consequential decisions in a RAG pipeline, and most tutorials gloss over it. The method you choose directly determines whether your retrieved context is coherent or garbage.
Why not fixed-size chunking? Fixed-size splits cut text at arbitrary character counts regardless of sentence or paragraph boundaries. A chunk might start mid-sentence and end mid-thought. The retrieved text makes no sense in isolation, and the model hallucinates to fill the gaps.
Why not semantic chunking? Semantic chunking uses an additional model pass to split at meaning boundaries. It produces better chunks, but it's slower, more expensive, and harder to debug. For a prototype, it's overkill and introduces a dependency that obscures the core architecture.
Why recursive character splitting? It splits on natural text boundaries — paragraphs first, then sentences, then words — only falling back to character-level splits when necessary. It respects how humans actually write. It's built into N8N natively. And at 512 tokens with 50-token overlap, it produces chunks that are large enough to have context, small enough to be specific, and overlapping enough to avoid answer fragments at boundaries.
06 — Data Preparation
Cleaning and tagging documents before ingestion
Before a single document enters the pipeline, it goes through three layers of preparation. Skip any one of them and the retrieval quality degrades significantly.
Cleaning removes everything that isn't content: watermarks, headers and footers, cover pages, page numbers, duplicate whitespace, and formatting artifacts from PDF extraction. The goal is pure, clean prose that a splitter can work with.
Metadata tagging is what makes cited answers possible. Every chunk gets tagged with its source document, section name, department, page number, and last-updated date. Without this, the model can return a correct answer with no verifiable source — which is worthless in an enterprise context.
Structure awareness means ensuring that headings stay attached to their content. A chunk that contains a policy rule but not the section heading it belongs to loses its organizational context. The preparation step explicitly connects heading hierarchy to body content before splitting.
"source": "Employee_Handbook_2026.pdf",
"section": "PTO Policy",
"department": "HR",
"page": 14,
"last_updated": "Jan 2026"
}
07 — Results
The same question. Two very different answers.
The clearest way to show what data preparation actually does is a direct comparison. Same question, same model, same pipeline — different source documents.
BEFORE — Unprepared Documents
Q: How many sick days do full-time employees get?
"Based on the documents provided, employees may be entitled to sick leave, however the exact number of days could not be confirmed from the available content."
AFTER — Prepared Documents
Q: How many sick days do full-time employees get?
"Full-time employees receive 10 sick days per calendar year. (Source: Employee Handbook 2026, Section 4.2 — Leave Policy, Page 14)"
08 — Tradeoffs
What I optimized for — and what I sacrificed
Every architecture decision is a tradeoff. Here's what I chose and why:
| Decision | Tradeoff |
|---|---|
| N8N for orchestration | Less flexibility than code, but faster to build, easier to document visually, and legible to non-engineers reviewing the architecture. |
| Recursive over semantic chunking | Lower recall ceiling on complex documents, but faster, cheaper, and sufficient for prototype scope. Semantic chunking is the right path for production. |
| Claude for generation | Higher cost per query than GPT-3.5, but significantly stronger at following citation instructions and grounding answers in provided context. |
| Supabase pgvector | Not as horizontally scalable as Pinecone for millions of vectors, but zero new infrastructure for an existing Supabase stack and fully sufficient at prototype scale. |
| Prototype scope (no auth) | No role-based access, no multi-user isolation, no audit trail — all required for production. Excluded intentionally to keep the focus on the data pipeline problem. |
09 — Next Version
What production would actually require
This prototype proves the concept. A production deployment for an enterprise would require a significant amount of additional engineering before it could be trusted with real data:
10 — Lessons
What building this taught me
The most important thing this prototype surfaced is something most AI content completely glosses over: the data problem is the hard problem. Before I built this, I assumed the interesting challenge was the model selection, the prompt engineering, the retrieval strategy. Those all matter. But the thing that determines whether a RAG system is actually useful in an enterprise context is whether anyone bothered to clean and structure the source documents before ingestion. Most don't. Most fail because of it.
The second thing I took away is about explainability. RAG is not a complex architecture once you've actually built one. The pipeline is logical, linear, and easy to trace. That matters for selling it internally, for debugging it, and for convincing a risk-averse organization that this isn't a black box. The complexity isn't in the plumbing — it's in the data quality, the prompting, and the guardrails. That's where the real work is, and it's also where the real value is.
The third insight is about governance. The questions that matter most in enterprise AI aren't "which model?" or "which vector database?" They're "who has access to what?", "what happens when the system returns something wrong?", and "how do we know it's actually working over time?" Those questions require an architect's mindset more than a data scientist's. That's the gap I'm trying to fill.