NVIDIA's Cooperative Vectors Finally Enable Per-Pixel Neural Networks

Running different neural networks for different pixels on a GPU used to require slow workarounds. NVIDIA's Cooperative Vectors extension changes that. Luca Quartesan at Evolve Solutions details the extension in a technical writeup. It shifts GPU interfaces from matrix-matrix to vector-matrix operations. Neural Materials and Neural Texture Compression need per-pixel evaluation of different networks. The older Cooperative Matrix standard couldn't handle that. Developers had to bucket pixels by material and run multiple dispatches. Quartesan calls the approach 'cumbersome and quite involved.' Cooperative Vectors works by combining individual vector-matrix multiply requests from threads in a wave into a collective matrix-matrix operation that gets hardware acceleration. NVIDIA uses Tensor Cores for this. Intel has Xe Matrix Extensions (XMX). AMD employs Wave Matrix Multiply Accumulate instructions. But the Vulkan implementation remains NVIDIA-only for now. DirectX is standardizing its approach by rebranding WaveMatrix to 'Linear Algebra,' suggesting broader cross-platform support is coming. The extension supports both inference and training pipelines. Neural Radiance Caching shows the training use case: small networks trained at runtime using inputs like normals and view directions. Two matrix layout options exist in the API. MulOptimal is for inference. OuterProductOptimal is for training and accumulation. Long vectors are stored directly in VGPRs rather than special registers, even when hardware acceleration kicks in. Quartesan notes that long vectors don't require uniform control flow for functionality, but uniform paths enable driver optimizations. Shader Execution Reordering can help with this. What previously required workarounds in compute shaders can now be a simple branch in your shader code.