Granite 4.1: IBM's 8B Model Matching 32B MoE

IBM just released Granite 4.1, a family of open-source models (3B, 8B, 30B) under Apache 2.0 with a 512K context window. The 8B model is doing something that shouldn't be possible. It matches or beats the previous Granite 4.0-H-Small, a 32B Mixture-of-Experts model with 9B active parameters, across every benchmark IBM ran. BFCL V3 tool calling: 68.3 versus 64.7. GSM8K math: 92.5. ArenaHard chat quality: 69.0. A smaller, simpler, dense architecture is winning consistently. IBM trained on 15 trillion tokens across five phases, shifting from broad web content (59% CommonCrawl early) to heavy math (35%) and code (30%) later. Data quality drove the results. They built an aggressive filtering pipeline using an LLM-as-Judge that scored every assistant response across six dimensions before any fine-tuning sample touched the model. Hallucinations and wrong computations got automatic rejection. What survived: 4.1 million curated samples. The four-stage reinforcement learning pipeline is where IBM got unusually honest about training failures. Stage one trained jointly across nine domains. Stage two, RLHF on chat, boosted AlpacaEval by 18.9 points. But it tanked math scores. GSM8K and DeepMind-Math both regressed. Stage three was a quick identity calibration. Stage four was dedicated math RL to recover the damage. It worked. GSM8K surpassed baseline by 3.8 points. You don't often see companies admit what broke mid-training. That kind of honesty helps anyone building their own models. Early community feedback says the 8B runs well on commodity hardware with a clinical tone suited for data processing agents. Some users still prefer Qwen3.6 35b a3b for heavier work. The broader signal: both IBM and Mistral shifted away from MoE toward dense models, even as frontier systems double down on MoE. If you need predictable latency and cost for enterprise deployment, dense architectures have a real argument right now.