Tree Search Distillation via MCTS+PPO Outperforms GRPO on Reasoning Tasks

DeepSeek-R1's reported struggles with Monte Carlo Tree Search were widely read as a verdict: MCTS doesn't work for language model reasoning. Independent researcher Ayush Tambde disagrees, and his new study identifies the culprit — DeepSeek used standard UCT, which ignores action-level priors, while Tambde's setup uses pUCT. That single implementation difference, he argues, explains the gap.

Tambde applies MCTS over reasoning steps — not individual tokens — to Qwen-2.5-1.5B-Instruct, distilling the stronger search policy back into the model via an online PPO loop in a setup borrowed from the AlphaZero paradigm. On the Countdown combinatorial arithmetic task, the MCTS-distilled model achieves 11.3% mean@16 without any search harness at inference time, against 8.4% for CISPO (a clipped importance-sampling PPO variant) and 7.7% for best-of-N sampling — all up from a pre-RL baseline of 3.1%.

The pUCT vs. UCT distinction is the paper's central argument. Drawing on an analysis by researcher Finbarr Timbers, Tambde attributes DeepSeek-R1's "limited success" with MCTS to a missing ingredient: sequence-level log-probability priors computed over full reasoning steps. In his setup, parallel MCTS workers share a per-sample search tree and use virtual losses to push trajectory diversity. A learned MLP value head guides the search. The infrastructure runs on 8xH100 GPUs coordinated via Rust, Redis, and gRPC.

A clarification from the Hacker News discussion matters here: MCTS runs only during training. The distilled model's inference cost is identical to a standard RL-trained model, so the compute overhead stays in training and doesn't follow the model to production. Tambde picked Countdown over the more common GSM8K benchmark because combinatorial problems reward parallel adaptive search; on arithmetic word problems, GRPO already keeps pace with tree methods.

Earlier attempts to bring AlphaZero-style training to language models — including work on code generation and formal mathematical reasoning — stalled on value estimation problems and branching costs at the token level. Step-level search sidesteps both. Tambde's 11.3% result is an explicit first baseline; low absolute scores reflect the 1.5B parameter ceiling, not the method's limits, and he plans tests at larger scale with more compute. If pUCT gains hold there, GRPO's current dominance in reasoning RL has a credible challenger.