Val just wanted to eject a stubborn external drive. Instead, he spent the day running six major LLMs through a real coding test. Writing on CodeJam, he had GPT-5.4, Opus 4.6, GLM-5.1, Kimi K2.5, MiMo V2 Pro, and MiniMax M2.7 each build a native macOS app for unmounting drives. All six produced reasonable plans and compiled successfully through the OpenCode platform, an AI coding agent framework. But compiling and working are different things. Most apps crashed at runtime, and Val had to step in to fix them. If you can give an LLM a tool to test its own output end to end, you save yourself a lot of painful iteration. Native app development doesn't have that loop yet.

The real differences showed up in code quality. GPT-5.4 took first overall, with Opus 4.6 right behind it. Both are expensive frontier models, so that tracks Claude Opus 4.7 analysis. The more interesting finding was GLM-5.1, from Tsinghua University spinoff Zhipu AI. It consistently ranked fourth overall but scored near the top for code cleanliness. If you don't mind steering the technical approach yourself and want cleaner output with less refinement, GLM-5.1 looks like the practical pick. And it costs a fraction of what GPT-5.4 or Opus charge.

The weirdest part of the test was the peer review experiment. Val had each model rank and score the others' code for cleanliness and technical approach. The models didn't hype themselves. Almost every model rated its own output at or below its average score from the others. GPT-5.4 was the only one that ranked itself first, but the consensus agreed. The cross-evaluation rankings lined up closely with Val's own assessment, which is either a sign of genuine code quality differences or a sign that all these models share similar aesthetic preferences about what good code looks like. Probably both.

Kimi K2.5 and MiniMax M2.7 sat at the bottom consistently. Kimi comes from Moonshot AI, the Yang Zhilin-founded startup that hit unicorn status in under eight months. MiniMax is backed by Tencent and miHoYo. Both have serious resources behind them. In this particular test, they didn't compete with the top four. Val's conclusion: comparing models is peak procrastination. Any of the top models will do fine, and GLM and MiMo are now playing in that league at lower cost. Pick one and get back to work.