A coding agent that runs entirely on a MacBook, at 72 tokens a second

Kyle Howells has published a working recipe for running a coding agent with no cloud at all: Gemma 4 26B-A4B, a roughly 16GB Q4 GGUF, under llama.cpp with Metal, on a 64GB Apple M1 Max, driving the open terminal agent Pi.

The number that matters is the speed-up from Gemma 4's new MTP multi-token-prediction draft model. Howells clocked the base model at 58.2 generation tokens a second; bolting on the MTP draft head for speculative decoding lifted it to 69.2, and sweeping the spec-draft-n-max setting reached 72.2, about a quarter faster. He notes the gain matters most when an agent is firing off many small tool calls rather than writing one long answer.

The frontier labs are tightening access, with Fable and Mythos just export-controlled, so a 26B model that holds near 70 tokens a second on a laptop is a useful reminder that "good enough, and entirely yours" is now a real option for everyday coding.