Gemma 4 Runs in Your Browser at 30 Tokens/Second, No Server Needed

Someone got a 3.1GB language model running entirely in your browser. The Prompt-to-Excalidraw demo takes Google's Gemma 4 E2B and runs it locally using WebGPU, turning text descriptions into diagrams at over 30 tokens per second. No server. No API key. Just Chrome and about 3GB of RAM.

How they squeeze the model into browser memory is the interesting part. The TurboQuant algorithm compresses the KV cache by 2.4× using polar decomposition and Quantized Johnson-Lindenstrauss transformations. It rotates weight matrices into an optimal subspace before quantization, preserving geometric relationships between vectors that standard quantization would destroy.

The demo also has the model output compact code (~50 tokens) instead of raw Excalidraw JSON (~5,000 tokens). That's a 100× reduction in what the model needs to generate, which cuts wait time considerably.

Most browser-based LLM demos have struggled to hit conversational speeds. WebLLM and similar projects typically manage 5-15 tokens/second on smaller models. This demo's combination of aggressive quantization and smart output formatting pushes past that barrier.

The trade-offs are real. Desktop Chrome 134+ only, because the demo relies on WebGPU subgroups, a feature Safari, iOS, and Firefox don't support yet. Mobile browsers won't work because they cap memory well below the ~3GB this needs. And yes, you're downloading a 3.1GB model. Hacker News commenters flagged that as a real problem, with suggestions for specialized CDNs to handle large WebAssembly downloads.

The TurboQuant implementation runs entirely in WGSL compute shaders. That's the real signal. Not just that a model runs in a browser, but that compression math can make it fast enough to be useful without dedicated ML hardware or server infrastructure.