Hugging Face has published official documentation laying out a path to running a full coding agent entirely on local hardware, combining llama.cpp — the C/C++ inference engine — with Pi, a coding agent now integrated into Hugging Face's tooling. The guide walks developers through configuring a hardware profile on HF Hub, browsing compatible GGUF-quantized models from the catalogue, spinning up an OpenAI-compatible API server via llama.cpp, and pointing Pi at that local endpoint. The result is a Claude Code-style agentic coding experience with no API costs and no data leaving the machine.

Pi is installed as an npm package and configured through a JSON file specifying the local server's base URL and model ID. Hugging Face's Local Apps feature ties the experience together: users register their hardware specs and get tailored model recommendations plus ready-to-run launch commands directly from model pages, cutting down the usual friction around quantization formats and server flags.

For those who want to go even leaner, the documentation also covers llama-agent, a fork by Gary149 that embeds the entire agent loop directly into the llama.cpp binary. With no Node.js process and no HTTP round-trips between agent and model, llama-agent eliminates external dependencies entirely — a single compiled binary handles model inference, tool calls, subagents, and an optional HTTP API server mode.