Strix Halo Runs Local LLMs, ROCm Pain Included

Marco Inacio got AMD's Strix Halo running local LLM inference with ROCm 7.2, and the results are promising if you're willing to tinker. He configured 128GB of unified memory shared between CPU and GPU on Ubuntu 24.04, running the Qwen3.6-35B-A3B model through llama.cpp inside a Podman container. The whole thing feeds into Opencode, a local AI coding interface, so he's got a complete on-device setup for AI-assisted programming. "There were some rough edges, but I think it was quite worth it," Inacio wrote.

Those rough edges are real. The GPU wasn't even detected until Inacio updated his BIOS. From there, he had to manually set reserved video memory to just 512MB, configure the Graphics Translation Table for memory sharing, and add kernel parameters like ttm.pages_limit=32768000 to GRUB. Leave too much memory for the GPU and the Linux kernel gets unstable. Leave too little and you waste the hardware's main advantage. It's a balancing act that CUDA users never have to think about.

The Hacker News community flagged that Inacio's post misses Strix Halo's biggest selling point for AI workloads: quad-channel RAM. LLM inference is memory-bound, and that quad-channel config is what makes Strix Halo competitive for local inference in the first place. Commenters also pointed to AMD's official Lemonade SDK, which includes gfx1151-specific builds of vLLM, llama.cpp, and a port of Apple's MLX framework. The MLX Engine reportedly delivers 83% better performance than Vulkan on Strix Halo, running as a pure C++ binary with cold start measured in seconds rather than minutes.

For anyone building AI agents that need to run locally, Strix Halo with ROCm is viable now but still demands patience. The tooling exists, the memory architecture genuinely helps with inference, and CUDA's advantage is shrinking. But you'll be reading GitHub issues, not polished documentation. Just don't expect plug-and-play.