SemiAnalysis just published a brutal five-month study comparing AMD's MI300X to Nvidia's H100 and H200 for AI training workloads. AMD's GPU looks great on paper, with specs that should beat Nvidia at a lower price. In practice, the MI300X delivers worse performance per dollar because AMD's software is a mess. Bugs make out-of-the-box training "impossible," according to the report by Dylan Patel, Daniel Nishball, and Reyk Knuhtsen.
The researchers didn't go easy on AMD. They shared benchmark code with both companies, identified AMD software bugs, and worked with AMD engineers to fix them. Even after all that effort, AMD's public stable release still falls short. The team had to use custom builds hand-crafted by AMD's principal engineers just to get reasonable numbers. Nvidia's H100 and H200 worked reliably on standard public releases. AMD's ROCm platform can't touch CUDA for reliability. That matters more than any spec sheet.
AMD knows it has a software problem. The company reorganized in 2023, promoting Andrej Zdravkovic to SVP of Software Engineering and consolidating AI efforts under Vamsi Boppana's new AI Group. They're hiring aggressively. But fixing a culture that treated software as an afterthought takes time. For customers who bought MI300X based on specs, that's cold comfort. The CUDA moat isn't about features. It's about reliability earned through years of investment. The hardware is ready. The software isn't.