NVIDIA Open-Sources GPU Cluster Recipes to End Config Chaos

Getting a GPU-accelerated Kubernetes cluster working is one thing. Getting it working the same way twice — across different clouds, hardware generations, and deployment environments — is another problem entirely. It's the kind of thing that turns into tribal knowledge, sprawling runbooks, and a resident expert who's memorized which NCCL tuning flag to flip for which environment.

NVIDIA's answer is AI Cluster Runtime (AICR), an open-source project that replaces that accumulated institutional knowledge with published, version-locked configuration recipes. Released in alpha by engineers Mark Chmarny and Nathan Taber, AICR delivers its recipes as composable YAML overlays — validated against NVIDIA's standards and queryable via a REST API or the `aicr` CLI.

The composition model is where the project gets interesting. Recipes are built from four layers: a base layer for universal defaults, an environment layer for cloud-specific dependencies (AWS EBS CSI drivers and EFA plugins for EKS, for instance), an intent layer that tunes NCCL differently for training versus inference, and a hardware layer that pins driver versions and unlocks accelerator-specific features like CDI and GDRCopy for H100 and Blackwell GPUs. A fully specialized recipe for Blackwell on EKS targeting training on Kubeflow spans 268 configuration values across 16 components. Switching from training to inference intent alone swaps five components and changes 41 configuration values.

AICR also ships a snapshot tool that captures live cluster state — OS, kernel, GPU hardware, driver, Kubernetes version, installed operators — and stores it as a ConfigMap baseline. That baseline drives pre-deployment readiness checks and post-deployment conformance validation against CNCF standards. Deployment integration covers ArgoCD, OCI bundle distribution, and fully air-gapped environments. Inference workloads target NVIDIA Dynamo; training workloads hook into Kubeflow Trainer.

The project is designed to be contributed to rather than forked. CSPs, OEMs, and platform teams can add environment-specific overlay extensions, and organizations can maintain private configuration overlays alongside public recipes without branching the core project. Every release ships with SLSA Level 3 provenance, signed SBOMs, and image attestations.