Wasm Meets Metal: Zero-Copy GPU Inference on Apple Silicon

Agam Brahma at Abacus Noir figured out how to run GPU inference from WebAssembly without copying memory on Apple Silicon. The trick exploits Apple's unified memory architecture, where the CPU and GPU share the same physical RAM. No copying between the VM sandbox and the accelerator. His benchmarks running Llama 3.2 1B Instruct show ~9ms per-token latency on a 2021 M1 MacBook Pro, with essentially zero memory overhead compared to 16.78 MB wasted on copy paths for a 16 MB region.

The implementation chains three pieces together. First, mmap allocates page-aligned memory. Second, Metal's zero-copy buffer API wraps that existing pointer so the GPU reads the same bytes. Third, Wasmtime's MemoryCreator trait lets you bring your own memory allocator to WebAssembly. The result: a Wasm module fills a matrix in its linear memory, the GPU computes on it in place, and the module sees results through the same pointer. Brahma verified this with a 128×128 matrix multiply, zero errors across 16,384 elements.

Transformer inference builds up a KV cache across conversation turns. Kill the process and that cache is gone. Brahma's approach lets him serialize the KV cache to safetensors format (1.1 ms for 24 tokens, 1.58 MB) and restore it later. Restore takes 1.4 ms versus 67.7 ms to re-prefill from scratch. At 4,096 tokens, restore would be roughly 100x faster than recomputation.

That's the difference between stateful agents that can pause and resume, and ones that can't.

Driftwood, Brahma's experimental runtime, is built around this capability. An actor can checkpoint its KV cache, move to a different machine, and pick up exactly where it left off. Hacker News commenters noted this works in Wasmtime rather than browsers, and questioned advantages over native host-side code. Fair points. But for anyone building sandboxed AI workloads on Apple Silicon, eliminating the copy tax is a real win.