Local Gemma 4: Why the Slower Model Wrote Better Code

Gemma 4 just made local agentic coding actually work. Daniel Vaughan spent a day testing Google's latest model on two machines and found that the 86.4% function-calling score on tau2-bench isn't marketing fluff. It's the difference between a model that can't call tools and one that can. Previous Gemma generations scored 6.6% on the same benchmark. That's basically broken.

Speed and quality pulled in opposite directions. Vaughan's M4 Pro MacBook Pro generated tokens 5.1x faster than the Dell GB10, thanks to the 26B Mixture of Experts architecture only activating 3.8 billion parameters per token instead of the GB10's full 31 billion. But the Mac's speed advantage got eaten by retries. The MoE variant left dead code in files, took five attempts to write a test suite, and needed ten tool calls to do what the GB10 did in three. The slower GB10 produced working code on the first try. GPT-5.4 finished in 65 seconds with the cleanest output. A model that gets it right the first time beats one that's fast but needs multiple attempts.

Setup was the real pain point. Ollama's Flash Attention freezes on Apple Silicon with prompts longer than 500 tokens. Codex CLI's system prompt alone is roughly 27,000 tokens. Vaughan had to switch to llama.cpp with six carefully configured flags just to avoid out-of-memory crashes. The GB10 path wasn't smoother either. vLLM failed due to PyTorch version incompatibility on Blackwell's ARM architecture. Ollama v0.20.5 ended up being the only thing that worked on NVIDIA's side. New headless tools are beginning to challenge Ollama's dominance in this space.

Gemma 4 is good enough for local agentic coding now. But pick the bigger model if you have the memory. While the 26B variant offers a significant speed boost, it can be memory-intensive. The 31B Dense variant's first-attempt accuracy matters more than raw tokens per second when your agent is making tool calls and writing real code.