Learning Is Forgetting: LLM Training as Lossy Compression

A new paper at ICLR 2026 makes a provocative claim: large language models don't learn by accumulating knowledge. They learn by forgetting it.

"Learning is Forgetting: LLM Training as Lossy Compression" — from a cross-disciplinary team including Henry Conklin, Tom Hosking, Tan Yi-Chern, Jonathan D. Cohen, Sarah-Jane Leslie, Thomas L. Griffiths, Max Bartolo, and Seraphina Goldfarb-Tarrant — reframes what pre-training actually does. Rather than building a comprehensive store of information, models selectively compress their training data, discarding everything that doesn't serve next-sequence prediction. The forgetting isn't a flaw in the process. It's the mechanism.

The team grounds this argument in Information Bottleneck theory, a framework with roots in both information theory and cognitive science. Their empirical work spans multiple open-weights model families and shows that pre-training consistently pushes models toward the Information Bottleneck bound — the theoretical ceiling of compression optimality for a given training objective. Different model families land at different points along that curve, likely reflecting differences in training data composition and recipes, but the same compression dynamic shows up across all of them.

The more immediately useful finding is that compression quality is predictive. How well a model has compressed its training signal — and what structure it preserved in doing so — turns out to forecast performance across a wide range of downstream benchmarks. That amounts to a potential new diagnostic: a way to assess a model's likely capabilities from its internal representational structure, without running task-specific probes or fine-tuning experiments.

For anyone trying to understand why frontier models trained on comparable data volumes can diverge so sharply in performance, the framework offers a candidate explanation. The gap may lie not in how much a model has seen, but in how efficiently it compressed what it saw.

The paper was submitted January 26 and last revised March 8, 2026. It is available on OpenReview under CC BY 4.0.