HR Policy Knowledge Copilot — Case Study

01 — The Problem

Knowledge trapped in documents nobody reads

Most organizations have policies, procedures, and handbooks that employees are expected to know — but practically never read. When someone needs an answer, they either send an email to HR, guess, or give up. The documents exist. The knowledge is there. It's just completely inaccessible in practice.

The average employee handbook runs 40–80 pages. Nobody is searching a PDF to find out how many sick days they get. The information gap isn't a training problem — it's an access problem. Employees need cited, instant answers from documents they can trust, without needing to know where to look or what to search for.

But here's the real problem that almost nobody talks about: the AI isn't what's broken. The data is. And that's exactly what this prototype was built to expose.

02 — Why AI

Why this needed RAG, not a search bar

A traditional keyword search finds documents that contain the words you typed. If you search "sick days," it returns every page that mentions the phrase. That's not useful when policy documents use phrases like "paid absence allowance" or "medical leave entitlement" — terminology employees don't know to look for.

Retrieval-Augmented Generation (RAG) works differently. It converts both the documents and the question into vector embeddings — numerical representations of meaning, not just words. The system finds chunks of text that are semantically similar to the question, even when the exact words don't match. Then it passes those relevant chunks to a language model that synthesizes a direct answer, with citations pointing back to the source.

For HR policy documents specifically, this matters because employees don't know the language of the policies. They know their situation. RAG bridges that gap — it understands what they mean, not just what they said.

03 — The Hard Part

Where most RAG projects fail before they start

The AI is not the problem. The data is. Most enterprise RAG demos use clean, pre-formatted documents. Real organizations hand you scanned PDFs, version-chaos Word docs, and files with zero metadata. This prototype was built to show exactly that — and what it takes to actually solve it.

To make this concrete, I built four intentionally difficult test documents — the kind you actually encounter inside organizations:

1

The Scanned Mess — An image-only PDF with a watermark, a decorative cover page, and zero extractable text structure. OCR fails on the watermark. The cover page gets ingested as content. Chunks contain nothing useful.

2

The Formatting Nightmare — Merged table cells, inconsistent heading hierarchy, footnotes embedded in the middle of body paragraphs. Text splitters don't know where a section ends and the next begins.

3

The Version Chaos Doc — An old policy document with crossed-out text mixed with updated language, conflicting headers, and two versions of the same rule on the same page. The model confidently returns the wrong answer from the outdated section.

4

The Clean Prepared Version — The same HR content, restructured with consistent headings, stripped noise, and proper metadata. The system answers correctly with accurate citations every time.

The difference in answer quality between document 3 and document 4 is not a model problem. It's entirely a data problem. That's the point this prototype proves.

04 — Architecture

How the pipeline works

The system has two distinct phases: ingestion and query. They're independent — you build the index once, then query it many times.

Phase 1 — Ingestion

↑ PDF Upload

↓

⚙ N8N Document Loader

↓

✂ Recursive Splitter (512t / 50t)

↓

◈ OpenAI Embeddings

↓

⊞ Supabase pgvector

Phase 2 — Query

? User Question

↓

◈ Embed Question (OpenAI)

↓

⊞ Vector Search — Top K Chunks

↓

◈ Claude + Context + Prompt

↓

✓ Cited Answer

Both phases run through N8N, which makes the workflow visual, debuggable, and easy to explain to a non-technical stakeholder. The ingestion workflow runs once per document batch. The query workflow runs on every user question.

05 — Chunking

Why recursive character splitting at 512 tokens

Chunking is one of the most consequential decisions in a RAG pipeline, and most tutorials gloss over it. The method you choose directly determines whether your retrieved context is coherent or garbage.

Why not fixed-size chunking? Fixed-size splits cut text at arbitrary character counts regardless of sentence or paragraph boundaries. A chunk might start mid-sentence and end mid-thought. The retrieved text makes no sense in isolation, and the model hallucinates to fill the gaps.

Why not semantic chunking? Semantic chunking uses an additional model pass to split at meaning boundaries. It produces better chunks, but it's slower, more expensive, and harder to debug. For a prototype, it's overkill and introduces a dependency that obscures the core architecture.

Why recursive character splitting? It splits on natural text boundaries — paragraphs first, then sentences, then words — only falling back to character-level splits when necessary. It respects how humans actually write. It's built into N8N natively. And at 512 tokens with 50-token overlap, it produces chunks that are large enough to have context, small enough to be specific, and overlapping enough to avoid answer fragments at boundaries.

Method Recursive Character Text Splitting

Chunk Size 512 tokens

Overlap 50 tokens (≈10%)

06 — Data Preparation

Cleaning and tagging documents before ingestion

Before a single document enters the pipeline, it goes through three layers of preparation. Skip any one of them and the retrieval quality degrades significantly.

Cleaning removes everything that isn't content: watermarks, headers and footers, cover pages, page numbers, duplicate whitespace, and formatting artifacts from PDF extraction. The goal is pure, clean prose that a splitter can work with.

Metadata tagging is what makes cited answers possible. Every chunk gets tagged with its source document, section name, department, page number, and last-updated date. Without this, the model can return a correct answer with no verifiable source — which is worthless in an enterprise context.

Structure awareness means ensuring that headings stay attached to their content. A chunk that contains a policy rule but not the section heading it belongs to loses its organizational context. The preparation step explicitly connects heading hierarchy to body content before splitting.

{

  "source": "Employee_Handbook_2026.pdf",

  "section": "PTO Policy",

  "department": "HR",

  "page": 14,

  "last_updated": "Jan 2026"

}

07 — Results

The same question. Two very different answers.

The clearest way to show what data preparation actually does is a direct comparison. Same question, same model, same pipeline — different source documents.

Q: How many sick days do full-time employees get?

"Based on the documents provided, employees may be entitled to sick leave, however the exact number of days could not be confirmed from the available content."

Q: How many sick days do full-time employees get?

"Full-time employees receive 10 sick days per calendar year. (Source: Employee Handbook 2026, Section 4.2 — Leave Policy, Page 14)"

08 — Tradeoffs

What I optimized for — and what I sacrificed

Every architecture decision is a tradeoff. Here's what I chose and why:

Decision	Tradeoff
N8N for orchestration	Less flexibility than code, but faster to build, easier to document visually, and legible to non-engineers reviewing the architecture.
Recursive over semantic chunking	Lower recall ceiling on complex documents, but faster, cheaper, and sufficient for prototype scope. Semantic chunking is the right path for production.
Claude for generation	Higher cost per query than GPT-3.5, but significantly stronger at following citation instructions and grounding answers in provided context.
Supabase pgvector	Not as horizontally scalable as Pinecone for millions of vectors, but zero new infrastructure for an existing Supabase stack and fully sufficient at prototype scale.
Prototype scope (no auth)	No role-based access, no multi-user isolation, no audit trail — all required for production. Excluded intentionally to keep the focus on the data pipeline problem.

09 — Next Version

What production would actually require

This prototype proves the concept. A production deployment for an enterprise would require a significant amount of additional engineering before it could be trusted with real data:

Role-based access control — department staff only see their department's documents, not the entire organization's knowledge base.

Automated ingestion pipeline — documents update the vector index automatically when files change in SharePoint, Google Drive, or a configured DMS.

Audit trail — every query, every answer, and every cited source logged for compliance review and incident investigation.

Hallucination guardrails — confidence scoring on retrieved chunks, fallback responses when retrieval quality is low, and human review triggers for low-confidence answers.

Multi-tenant support — each organizational unit or client gets an isolated vector index with strict data boundary enforcement.

Evaluation framework — automated testing of answer quality against a golden question set, run on every document update to catch regressions.

10 — Lessons

What building this taught me

The most important thing this prototype surfaced is something most AI content completely glosses over: the data problem is the hard problem. Before I built this, I assumed the interesting challenge was the model selection, the prompt engineering, the retrieval strategy. Those all matter. But the thing that determines whether a RAG system is actually useful in an enterprise context is whether anyone bothered to clean and structure the source documents before ingestion. Most don't. Most fail because of it.

The second thing I took away is about explainability. RAG is not a complex architecture once you've actually built one. The pipeline is logical, linear, and easy to trace. That matters for selling it internally, for debugging it, and for convincing a risk-averse organization that this isn't a black box. The complexity isn't in the plumbing — it's in the data quality, the prompting, and the guardrails. That's where the real work is, and it's also where the real value is.

The third insight is about governance. The questions that matter most in enterprise AI aren't "which model?" or "which vector database?" They're "who has access to what?", "what happens when the system returns something wrong?", and "how do we know it's actually working over time?" Those questions require an architect's mindset more than a data scientist's. That's the gap I'm trying to fill.