Building a Reliable Locally-Hosted Voice Assistant with llama.cpp and Home Assistant

Nicolas Mowen, writing under the handle crzynik on the Home Assistant community forums, has published a comprehensive guide documenting his multi-year effort to replace a Google Home setup with a fully local voice assistant. The October 2025 post covers the complete stack: Home Assistant Assist as the orchestration layer, llama.cpp as the preferred model runner (chosen over Ollama for performance), Wyoming ONNX ASR with Nvidia Parakeet V2 for speech-to-text at roughly 0.3-second CPU inference via OpenVINO, and Kokoro TTS for voice output. On the hardware side, Mowen tested GPU configurations from an RTX 3050 (8GB, adequate for 4B parameter models) up to an RTX 3090 or RX 7900XTX (24GB), the latter capable of running 20B to 30B mixture-of-experts models with 1 to 2 second response times after prompt caching on Beelink MiniPCs with USB4 eGPU enclosures, while competing platforms like RunAnywhere RCLI achieve sub-200ms latency on Apple Silicon.

The guide places unusual emphasis on prompt engineering. A companion GitHub Gist provides a detailed system prompt encoding TTS-safe response formatting rules, a decision hierarchy for handling unclear or garbled input, and phonetic transcription error correction — illustrating that making open-weight LLMs like Qwen3-30B-A3B or GLM-4.7 Flash behave reliably as voice interfaces requires substantial tuning beyond model selection and quantization. Mowen benchmarks models across practical criteria including multi-device tool calls, contextual awareness, misheard command parsing, and resistance to false activations, with higher-quantization GGUF models sourced from HuggingFace via Unsloth performing notably better than the default low-quantization options bundled with Ollama.

Hacker News commentary surfaces two friction points that the guide itself does not fully resolve. Wake word detection was identified by commenter hamdingers as achieving less than 50 percent the reliability of commercial Echo devices — a UX-level failure that makes the overall experience feel broken regardless of how well the rest of the system performs. On the TTS side, Lily Clifford, cofounder of Rime AI, attributed the unnatural prosody of tools like Kokoro and Piper to their read-speech training data, recommending Coqui XTTS-v2 as the best self-hosted alternative for more natural English output. Other open-source projects like OpenToys emphasize TTS quality through voice cloning and alternative voices, acknowledging the same limitations Clifford identified. A third commenter proposed sidestepping wake word detection entirely by routing voice commands through analog landline phones via a Grandstream HT801 ATA adapter and Home Assistant's VoIP integration.

The local voice assistant stack is genuinely competitive with commercial products for device control and general Q&A — the hard work is now mostly in the setup, not the fundamentals. Two gaps remain: wake word detection still lags behind Alexa, and TTS prosody sounds robotic in most configurations. Both are known, tractable problems. For technically inclined users with mid-range GPU hardware, the question is no longer whether a local voice assistant can work, but how much configuration time they are willing to spend.