Wasm Now Talks Directly to Apple GPU, 5x Faster AI Restores

Agam Brahma has demonstrated that WebAssembly modules running on Apple Silicon can share memory directly with the GPU, no copies required. The proof of concept works end to end: a Wasm module fills a matrix, the GPU computes on it in place, and the results appear back in the module's linear memory through the same pointer. Tested with Llama 3.2 1B inference on an M1 MacBook Pro, the Wasm-to-GPU dispatch overhead measured as negligible. Brahma is building a project called Driftwood that uses this approach for stateful AI inference actors. Three components make this work. mmap provides page-aligned memory on ARM64 macOS. Metal's bytesNoCopy API wraps that existing pointer as a GPU buffer without copying. Wasmtime's MemoryCreator trait lets you control how linear memory is allocated. Chain them and both the Wasm runtime and the GPU operate on identical physical bytes. Brahma verified pointer identity and confirmed zero hidden copies: RSS delta for a 16MB region measured 0.03MB on the zero-copy path versus 16.78MB for the explicit copy path. The real benefit: KV cache serialization for transformer inference. Because the cache lives in controlled, GPU-accessible memory, Brahma can dump it to safetensors format and restore it later. Restoring a 24-token cache took 1.4ms compared to 67.7ms for re-prefilling from scratch, a 5.45x speedup that scales with context length. At 4,096 tokens, restore would be roughly 100x faster than recomputation. Brahma is building a project called Driftwood that uses this approach for stateful AI inference actors. The zero-copy path only works because Apple Silicon shares physical memory between CPU and GPU. On discrete GPU systems with PCIe buses, you're still stuck copying.