Home / Lab / Private AI at Home with Ollama + Open WebUI
Ongoing AI & LLMs 7 min read

Private AI at Home with Ollama + Open WebUI

Running large language models locally means your prompts never leave your machine. Here's how I set up Ollama and Open WebUI on my home lab, what models I use daily, and where the limits are.

May 18, 2025
OllamaOpen WebUIDockerProxmox VELlama 3Mistral

I spend most of my working day talking to LLMs. Architecture reviews, writing infrastructure runbooks, drafting proposals, working through tricky Terraform. For the professional and personal stuff where I’d rather not have my prompts sent to a third-party API, I run models locally.

The stack is simple: Ollama to manage and serve models, Open WebUI as the browser-based interface, both running in Docker on my home lab.

Why Local?

Three reasons:

  1. Privacy: Conversations about client infrastructure, internal architecture, or anything commercially sensitive don’t leave my machine
  2. Cost: No API bills. I run local models as much as I can, and use cloud APIs only when the task genuinely requires it
  3. Latency when offline: Flights, trains, remote sites — local models work without a connection

The tradeoff is capability. Local models at reasonable hardware constraints are behind the frontier models. But for a large fraction of daily tasks, they’re good enough.

Hardware

Ollama runs on my Proxmox host (the gaming laptop described in my homelab post). The CPU-only inference is usable for smaller models — Llama 3.2 3B runs at a comfortable speed, and Mistral 7B is acceptable for non-latency-sensitive tasks.

If you have a machine with a CUDA GPU, inference speed improves dramatically. On a laptop without GPU passthrough to the VM, I’m CPU-bound.

Setup

Ollama

# On the host or in a Docker container
curl -fsSL https://ollama.com/install.sh | sh

# Pull models
ollama pull llama3.2:3b
ollama pull mistral:7b
ollama pull nomic-embed-text  # for embeddings / RAG

Ollama runs as a systemd service and listens on port 11434. It handles model storage, versioning, and GPU/CPU dispatch automatically.

Open WebUI via Docker

# docker-compose.yml
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    volumes:
      - open-webui:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
    extra_hosts:
      - "host.docker.internal:host-gateway"
    restart: unless-stopped

volumes:
  open-webui:

Open WebUI gives you a ChatGPT-like interface with conversation history, model switching, system prompt configuration, and RAG (retrieval-augmented generation) if you want to chat with documents.

Access it at http://<your-lab-ip>:3000 from any device on your local network (or via Tailscale from anywhere).

Models I Use

ModelUse caseSpeed (CPU)
llama3.2:3bQuick Q&A, draftingFast
mistral:7bCode review, longer reasoningModerate
deepseek-coder:6.7bInfrastructure code, TerraformModerate
nomic-embed-textEmbeddings for document searchFast

For anything requiring frontier-level reasoning — complex architectural tradeoffs, novel problem-solving — I still use Claude or GPT. Local models are a complement, not a replacement.

What Works Well

  • Drafting runbooks and documentation from bullet points
  • Explaining infrastructure concepts for client-facing materials
  • Code review of Terraform/Ansible before I commit
  • Generating boilerplate that I then edit
  • Querying uploaded PDFs (architecture docs, vendor whitepapers)

What Doesn’t

  • Long context tasks with 100K+ tokens — 7B models struggle
  • Nuanced reasoning about unfamiliar domains
  • Anything where you genuinely need the best available model

Accessing It Remotely

Via Tailscale, Open WebUI is reachable from my phone and laptop from anywhere. I have it bookmarked as a PWA on mobile. It’s not as fast as cloud APIs over mobile data, but for async drafting tasks it works.

Next Steps

I’m experimenting with a local RAG pipeline using nomic-embed-text + a ChromaDB container to query my own notes and runbook library. More on that when it’s working reliably.

If you have a spare machine with 16GB+ RAM, running local AI is worth the two-hour setup. The privacy and offline availability alone justify it.

Discussion