You can now run a 3.1GB language model in your browser. No server. No API key. No waiting. A developer named Chong built a demo that runs Google's Gemma 4 model using WebGPU to generate Excalidraw diagrams from text prompts. Type something like "OAuth 2.0 authorization code flow as a sequence diagram" and it draws it, locally, at over 30 tokens per second. The trick is compression. Browser GPUs don't have much memory. Chong's TurboQuant algorithm implements polar decomposition and QJL rotation in WGSL compute shaders, compressing the model's KV cache by about 2.4x. That's what makes longer conversations possible without blowing past your GPU memory limit. A second optimization helps too. Raw Excalidraw JSON needs around 5,000 tokens. The model generates compact code instead, averaging about 50 tokens. Less output, faster generation, less memory pressure. The catch: desktop Chrome 134+, WebGPU support, and roughly 3GB of free RAM. Safari and iOS are out because they lack the required WebGPU subgroup features. Mobile browsers cap memory well below what this needs. Chong released the TurboQuant algorithm as the turboquant-wasm npm package. Running a 3 billion parameter model at 30 tokens per second in a browser tab would have sounded absurd two years ago. Now the question is what else fits.
Gemma 4 E2B runs entirely in your browser, draws Excalidraw locally
A browser-based demo runs Google's Gemma 4 E2B language model entirely in the browser using WebGPU to generate Excalidraw diagrams from text prompts. The LLM outputs compact code (~50 tokens) instead of raw Excalidraw JSON (~5,000 tokens), and the TurboQuant algorithm (polar + QJL) compresses the KV cache by ~2.4x so longer conversations fit in GPU memory. Requires desktop Chrome 134+ with WebGPU support and ~3GB RAM.