Running different neural networks for different pixels on a GPU used to require slow workarounds. NVIDIA's Cooperative Vectors extension changes that. Luca Quartesan at Evolve Solutions details the extension in a technical writeup. It shifts GPU interfaces from matrix-matrix to vector-matrix operations. Neural Materials and Neural Texture Compression need per-pixel evaluation of different networks. The older Cooperative Matrix standard couldn't handle that. Developers had to bucket pixels by material and run multiple dispatches. Quartesan calls the approach 'cumbersome and quite involved.' Cooperative Vectors works by combining individual vector-matrix multiply requests from threads in a wave into a collective matrix-matrix operation that gets hardware acceleration. NVIDIA uses Tensor Cores for this. Intel has Xe Matrix Extensions (XMX). AMD employs Wave Matrix Multiply Accumulate instructions. But the Vulkan implementation remains NVIDIA-only for now. DirectX is standardizing its approach by rebranding WaveMatrix to 'Linear Algebra,' suggesting broader cross-platform support is coming. The extension supports both inference and training pipelines. Neural Radiance Caching shows the training use case: small networks trained at runtime using inputs like normals and view directions. Two matrix layout options exist in the API. MulOptimal is for inference. OuterProductOptimal is for training and accumulation. Long vectors are stored directly in VGPRs rather than special registers, even when hardware acceleration kicks in. Quartesan notes that long vectors don't require uniform control flow for functionality, but uniform paths enable driver optimizations. Shader Execution Reordering can help with this. What previously required workarounds in compute shaders can now be a simple branch in your shader code.
NVIDIA's Cooperative Vectors Finally Enable Per-Pixel Neural Networks
Technical deep-dive into Cooperative Vectors, a GPU hardware extension for accelerating neural network inference and training in rendering engines. Enables vector-matrix operations that support divergent workloads across different pixels, solving challenges with Neural Materials, Neural Radiance Caching (NRC), and Neural Texture Compression. Covers the shift from Cooperative Matrix to Cooperative Vector, implementation details across Vulkan and DirectX, matrix layouts (MulOptimal and OuterProductOptimal), and inference/training pipelines for MLPs. Discusses hardware from NVIDIA (Tensor Cores), Intel (XMX), and AMD (WMMA), and notes DirectX rebranding to 'Linear Algebra' while Vulkan's Cooperative Vector remains NVIDIA-specific.