TeamChong shipped a demo that runs Gemma 4 E2B entirely inside desktop Chrome. No server calls. No API keys. You type a prompt describing a diagram, and the model generates an Excalidraw drawing right there in your browser. It hits over 30 tokens per second on modern hardware, fast enough to feel responsive. The trick is TurboQuant, a KV cache compression algorithm from Google Research's upcoming ICLR 2026 paper. TeamChong reimplemented it in WGSL compute shaders to run on your GPU through WebGPU. Compression sits at roughly 2.4x. Longer outputs fit in GPU memory without choking. There's also an output shortcut: the model generates about 50 tokens of compact code instead of roughly 5,000 tokens of raw Excalidraw JSON. Don't get too excited if you're not on desktop Chrome 134 or later. The demo needs WebGPU subgroups, which Safari, iOS, and mobile browsers don't support yet. You'll also need around 3GB of RAM free. But a 3 billion parameter model running locally in a browser tab, producing useful output at conversational speeds? That's a real milestone for local-first AI. TeamChong also maintains a WASM+SIMD version of TurboQuant for CPU-side vector search via npm.
3B params, zero servers: Gemma 4 runs in Chrome at 30 tok/s
A browser-based demo running Gemma 4 (a 3.1GB quantized model) entirely in Chrome using WebGPU to generate Excalidraw diagrams from text prompts. The TurboQuant algorithm compresses the KV cache 2.4×, letting longer prompts and outputs fit in GPU memory, achieving 30+ tokens/second on desktop Chrome 134+.