A new integration guide from Unsloth has surfaced a significant performance bug in Claude Code: an attribution header the tool injects on every API request breaks llama-server's context caching, slowing inference by roughly 90% when using a local model backend. The fix is one line — setting CLAUDE_CODE_ATTRIBUTION_HEADER=0 disables the header entirely and restores normal speeds — but developers running self-hosted setups had no obvious way to know it was needed.

The problem compounds in agentic workflows. Claude Code makes many sequential tool calls during a coding session, and each one forces a full KV cache rebuild on the local server. Across dozens of turns, the latency hit stacks up fast.

The guide itself is a walkthrough for running two open-source models locally — Alibaba's Qwen3.5-35B-A3B and THUDM's GLM-4.7-Flash — via llama.cpp's OpenAI-compatible server endpoint, routing Claude Code's agent calls through them instead of Anthropic's API. Unsloth rates both as strong performers for agentic coding as of early 2026. The setup targets consumer hardware, with the recommended configuration fitting within 24GB of RAM or unified memory, putting it within reach of RTX 4090 owners and Apple Silicon users.

Unsloth provides optimized UD-Q4_K_XL quantizations for both models on Hugging Face. The guide covers building llama.cpp from source with CUDA or Metal support, pulling the quantized weights, and configuring the server with sampling parameters suited to agentic use — temperature 0.6, top-p 0.95, top-k 20 for Qwen3.5, with KV cache quantized to q8_0 to keep VRAM pressure down.

The company also notes that Claude Code has changed substantially since January 2026, requiring several manual configuration toggles that earlier versions didn't need — a reminder of how quickly the tooling landscape around local LLM development is shifting.

The full documentation, including hardware fallback options and model-specific configuration notes, is available at unsloth.ai/docs/basics/claude-code.