LLMs Teach Themselves to Code Better, Gain 13 Points

The technique, called Simple Self-Distillation (SSD), is exactly what it sounds like. You sample code solutions from a model at specific temperatures, then fine-tune the model on those same samples. No verifier or teacher model required. Reinforcement learning isn't part of the picture either. Just the model training on its own outputs and getting better.

The results are hard to ignore. Qwen3-30B-Instruct went from 42.4% to 55.3% pass@1 on LiveCodeBench v6. That's a 13-point jump from what amounts to self-study. The gains showed up most on harder problems, and the method worked across both Qwen and Llama architectures at 4B, 8B, and 30B scales. Meta's Navdeep Jaitly and Richard He Bai teamed up with Alibaba's Ruixiang Zhang on this one, which is notable given both companies compete on open-weight models.

Why does this work? The researchers point to something they call a "precision-exploration conflict" in how models decode tokens. When a model writes code, it needs to be precise about syntax but creative about problem-solving. These goals work against each other. SSD reshapes token distributions so the model suppresses low-probability noise where precision matters (like getting syntax right) while keeping useful diversity where exploration helps (like trying different solution paths). The model learns to recover from its own mistakes rather than just memorizing expert solutions.

This matters for anyone running local coding models. SSD offers a post-training path that doesn't require massive compute or external supervision. Earlier work like Self-Distillation Fine-Tuning (SDFT) from MIT and ETH Zurich explored similar territory. The fact that SSD works across model families suggests the precision-exploration conflict is a fundamental property of decoder-only transformers, not a quirk of one architecture.