I spend most of my working day talking to LLMs. Architecture reviews, writing infrastructure runbooks, drafting proposals, working through tricky Terraform. For the professional and personal stuff where I’d rather not have my prompts sent to a third-party API, I run models locally.
The stack is simple: Ollama to manage and serve models, Open WebUI as the browser-based interface, both running in Docker on my home lab.
Why Local?
Three reasons:
- Privacy: Conversations about client infrastructure, internal architecture, or anything commercially sensitive don’t leave my machine
- Cost: No API bills. I run local models as much as I can, and use cloud APIs only when the task genuinely requires it
- Latency when offline: Flights, trains, remote sites — local models work without a connection
The tradeoff is capability. Local models at reasonable hardware constraints are behind the frontier models. But for a large fraction of daily tasks, they’re good enough.
Hardware
Ollama runs on my Proxmox host (the gaming laptop described in my homelab post). The CPU-only inference is usable for smaller models — Llama 3.2 3B runs at a comfortable speed, and Mistral 7B is acceptable for non-latency-sensitive tasks.
If you have a machine with a CUDA GPU, inference speed improves dramatically. On a laptop without GPU passthrough to the VM, I’m CPU-bound.
Setup
Ollama
# On the host or in a Docker container
curl -fsSL https://ollama.com/install.sh | sh
# Pull models
ollama pull llama3.2:3b
ollama pull mistral:7b
ollama pull nomic-embed-text # for embeddings / RAG
Ollama runs as a systemd service and listens on port 11434. It handles model storage, versioning, and GPU/CPU dispatch automatically.
Open WebUI via Docker
# docker-compose.yml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
volumes:
- open-webui:/app/backend/data
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:11434
extra_hosts:
- "host.docker.internal:host-gateway"
restart: unless-stopped
volumes:
open-webui:
Open WebUI gives you a ChatGPT-like interface with conversation history, model switching, system prompt configuration, and RAG (retrieval-augmented generation) if you want to chat with documents.
Access it at http://<your-lab-ip>:3000 from any device on your local network (or via Tailscale from anywhere).
Models I Use
| Model | Use case | Speed (CPU) |
|---|---|---|
llama3.2:3b | Quick Q&A, drafting | Fast |
mistral:7b | Code review, longer reasoning | Moderate |
deepseek-coder:6.7b | Infrastructure code, Terraform | Moderate |
nomic-embed-text | Embeddings for document search | Fast |
For anything requiring frontier-level reasoning — complex architectural tradeoffs, novel problem-solving — I still use Claude or GPT. Local models are a complement, not a replacement.
What Works Well
- Drafting runbooks and documentation from bullet points
- Explaining infrastructure concepts for client-facing materials
- Code review of Terraform/Ansible before I commit
- Generating boilerplate that I then edit
- Querying uploaded PDFs (architecture docs, vendor whitepapers)
What Doesn’t
- Long context tasks with 100K+ tokens — 7B models struggle
- Nuanced reasoning about unfamiliar domains
- Anything where you genuinely need the best available model
Accessing It Remotely
Via Tailscale, Open WebUI is reachable from my phone and laptop from anywhere. I have it bookmarked as a PWA on mobile. It’s not as fast as cloud APIs over mobile data, but for async drafting tasks it works.
Next Steps
I’m experimenting with a local RAG pipeline using nomic-embed-text + a ChromaDB container to query my own notes and runbook library. More on that when it’s working reliably.
If you have a spare machine with 16GB+ RAM, running local AI is worth the two-hour setup. The privacy and offline availability alone justify it.