Local LLM Pipeline — Case Study

01 — The Idea

What does running AI without an API actually require?

Most AI development today abstracts the infrastructure away entirely. You call an API, get a response, and never think about what's happening underneath. That works fine for shipping product. But it means you're building on assumptions you haven't tested: that the API will always be available, that your data can leave your machine, that someone else's infrastructure decisions are acceptable for your use case.

I wanted to understand what happens when those assumptions go away. No external API. No data leaving the machine. Full control over the model, the runtime, and the pipeline. The Mac Mini wasn't the workaround. It was the constraint — and the point.

The secondary goal was directly relevant to enterprise AI work. Organizations with strict data governance requirements, regulated industries, or air-gapped environments can't just call OpenAI. Understanding how to build a capable local AI stack is the foundation for being credible in those conversations.

02 — The Constraint

16GB of Unified Memory shared across everything

The M4 Mac Mini's 16GB of Unified Memory is shared between the CPU, GPU, and Neural Engine. There is no separate GPU VRAM. Everything — the OS, the inference engine, the embedding model, the vector database, the orchestration layer — competes for the same pool.

This is fundamentally different from cloud-based AI development, where you provision compute independently and scale horizontally. On this machine, the architecture had to fit inside roughly 8.5GB of active RAM, leaving the rest free for macOS and context spikes. Exceeding that means hitting swap on the SSD, which collapses inference speed under sustained load.

That single constraint shaped every decision that followed.

Component	Technology	Memory Footprint
Inference Engine	Ollama	~100 MB idle
The LLM	Llama 3.1 8B (Q4_K_M)	~4.7 GB active
Embedding Model	nomic-embed-text	~500 MB active
Vector Database	ChromaDB (Docker)	~500 MB – 1 GB
Orchestrator	n8n (Docker)	~1 GB – 1.5 GB

03 — Architecture

Two phases, one machine

The pipeline runs in two independent phases: ingestion and query. They share the same infrastructure but operate separately. The knowledge base gets built once during ingestion, then queried as many times as needed.

Phase 1 — Ingestion

↑ Data Source (folder / feed)

↓

⚙ n8n Pipeline Trigger

↓

✂ Chunking (1,000 chars / 200 overlap)

↓

◈ nomic-embed-text (Ollama)

↓

⊞ ChromaDB — Persistent SSD

Phase 2 — Query

? User Prompt

↓

◈ Embed Prompt (nomic)

↓

⊞ Vector Search — Top 3 Chunks

↓

⚙ Prompt Injection (n8n)

↓

◈ Llama 3.1 8B (Ollama)

↓

✓ Grounded Answer

Both phases route through n8n, which keeps the workflow visual and debuggable without writing orchestration code. The ingestion workflow runs on demand when new data is added. The query workflow runs on every prompt.

04 — Key Decisions

Why each component was chosen and configured the way it was

Ollama runs natively, not inside Docker. The M4 Neural Engine and Unified Memory architecture require direct OS-level access to deliver GPU acceleration. Running Ollama inside a Docker container adds an abstraction layer that severs that connection and drops inference performance significantly. Native macOS installation is not optional here — it's the only way to use the hardware correctly.

Llama 3.1 8B at Q4_K_M quantization. Quantization reduces model precision to shrink the memory footprint. Q4_K_M is a 4-bit quantization method that brings Llama 3.1 8B from roughly 16GB (full precision) down to 4.7GB with minimal degradation in reasoning quality. It's the highest-capability model that fits inside the memory budget alongside every other component. Going to a smaller model (3B, 1B) would have freed memory but significantly reduced the quality of synthesis. Q4_K_M is the right balance point for this hardware.

ChromaDB and n8n run in Docker with a hard 3GB cap. Docker's memory limit setting creates a strict sandbox. If ChromaDB or n8n spike unexpectedly, they cannot consume memory that Ollama needs for inference. Without this cap, a ChromaDB indexing operation on a large document batch could push total memory usage past the threshold and force the system into swap. The cap makes the memory budget enforceable, not just planned.

Context windows stay between 4,000 and 8,000 tokens. Llama 3.1 technically supports up to 128k context. On 16GB hardware, saturating that window consumes memory proportional to context length and tanks token generation speed noticeably. The practical ceiling for this setup is around 8k tokens — enough for meaningful multi-turn queries and several retrieved chunks, without collapsing throughput.

n8n connects to Ollama via host.docker.internal, not localhost. When n8n runs inside Docker, "localhost" resolves to the container itself, not the host machine where Ollama is running. The correct base URL for Ollama from inside Docker is http://host.docker.internal:11434. This is a non-obvious configuration issue that breaks the entire pipeline silently if missed.

05 — Tradeoffs

What I optimized for and what I gave up

Decision	Tradeoff
Q4_K_M quantization	Slight quality degradation vs full precision, but the only way to fit a capable model in the memory budget. Full precision Llama 3.1 8B would consume the entire 16GB alone.
Fixed-size chunking at 1,000 characters	Simpler than semantic chunking and sufficient for well-structured documents. On messy or unstructured data, retrieval quality degrades — semantic chunking would improve this at the cost of an additional model pass.
ChromaDB over a dedicated vector DB	Easier to run locally via Docker and sufficient for personal-scale knowledge bases. A production deployment serving many users would require a more scalable solution like Qdrant or Weaviate.
n8n for orchestration	Visual and fast to iterate on, but less flexible than code-based orchestration. For complex multi-step pipelines with conditional logic, a Python-based framework would offer more control.
No authentication layer	This is a personal, single-user setup. A multi-user deployment would require role-based access, user isolation, and audit logging before any sensitive data could be ingested.

06 — Lessons

What building this actually clarified

The most significant shift was understanding the difference between building on top of AI and building with it. Calling an API means you never see the infrastructure decisions. You don't encounter memory pressure, quantization tradeoffs, context window limits as a hardware reality, or the configuration gap between a containerized service and a native one. Those constraints exist in cloud deployments too — they're just invisible because someone else managed them.

Working through them directly changes how you evaluate enterprise AI deployments. When an organization says their AI system is "running locally" or "air-gapped," you now have a concrete understanding of what that actually means to implement, what it costs in terms of hardware and configuration complexity, and what the failure modes look like. That's not knowledge you pick up from documentation.

The second takeaway is about quantization. Most discussions of local LLMs treat it as a footnote. In practice, the quantization decision determines whether a given model is even possible on given hardware, and Q4_K_M represents a genuinely useful quality-to-size ratio that holds up well for RAG use cases where the model is primarily synthesizing retrieved context rather than generating from pure parametric memory.