SemiAnalysis just published a brutal five-month benchmarking study of AMD's MI300X against Nvidia's H100 and H200, and the results aren't pretty for AMD. As the detailed report explains, real-world AI training performance falls well short of Nvidia's offerings. The problem is software. According to the report authored by Dylan Patel, Daniel Nishball, and Reyk Knuhtsen, AMD's ROCm stack is riddled with bugs that made out-of-the-box training "impossible."

Nvidia's "CUDA moat" remains very real. Their integrated ecosystem, including optimized libraries like NCCL and mature networking support via InfiniBand, delivers a stable experience that AMD's ROCm platform cannot yet match. SemiAnalysis notes that as fast as AMD works to close the gap, Nvidia's engineers keep extending their lead with new features and performance updates. This software immaturity wipes out AMD's hardware cost advantage, making the total cost of ownership for training workloads less favorable than it appears on paper.

AMD isn't sitting still. The company restructured in 2023, pulling AI hardware and software into one group led by Senior Vice President Vamsi Boppana, who reports to Forrest Norrod. The idea was to get the ROCm team and hardware architects working together instead of in isolation, since that disconnect contributed to the MI300X's rocky launch. AMD is also leaning on partnerships with Hugging Face and the PyTorch Foundation to mature its stack faster than internal hiring alone would allow. But that dependence on external collaborators carries risk. Critical components for enterprise stability sit partially outside AMD's direct control.

The recommendation is blunt. AMD needs to rethink how it builds software, throw more people at QA, and expand internal testing. The engineers are capable. The hardware has potential. But until the software catches up, the MI300X will keep underperforming its specs.