A Hacker News thread titled "Who is using Ollama day-to-day?" drew a dense response from developers, researchers, and self-hosters detailing exactly how they run local LLMs. The picture that emerges is narrower and more practical than the broad "local AI" narrative: most serious users are there for one of two reasons — data they can't send to OpenAI or Anthropic, or LLM call volumes that would make cloud pricing untenable.

Ollama wraps llama.cpp into a server with an OpenAI-compatible REST API, making it a drop-in for tools already built around GPT-4. That compatibility is the single most-cited adoption driver in the thread. Developers running LangChain pipelines, Continue.dev, or Aider pointed out they switched to Ollama without touching their integration code — just swapped the endpoint and API key.

The models getting the most daily use are quantized 4-bit GGUF variants of Meta's Llama 3, Mistral 7B, and Google's Gemma 2. Users on Apple Silicon M-series chips or NVIDIA GPUs with 16 to 32GB of VRAM describe this as the practical operating range. Quality drops compared to GPT-4o are real and acknowledged, but for <a href="/news/2026-03-15-godex-building-a-free-ai-coding-agent-with-mcp-servers-and-local-llms-via-ollama">offline coding assistance</a>, document summarization via RAG, and private chat through Open WebUI, commenters say the trade-off holds.

The thread also surfaces a longer tail of uses: <a href="/news/2026-03-15-promptcmd-llm-prompts-as-cli-commands">scripted batch document processing</a>, home assistant integrations, and NPC dialogue generation in indie game development. Ollama's Docker support and cross-platform builds across macOS, Linux, and Windows make it easy to slot into homelab setups and self-hosted pipelines.

The friction points are consistent and specific. GPU VRAM is the hard ceiling — it caps which model sizes are usable for most hardware. Multi-GPU support and model sharding are still incomplete, which means users who want to run larger models are either stuck or splitting workloads manually. Response latency on anything above a 13B parameter model is a real problem for interactive applications, not just a benchmark footnote.

The thread sits as a useful snapshot of where local LLM tooling actually is in early 2026: capable enough for production automation workflows, still constrained enough that hardware specs determine what you can run.