Paper: LM Head Is a Gradient Bottleneck Suppressing 95-99% of Gradient Norm in LLM Training

A new paper from Cornell University researchers Nathan Godey and Yoav Artzi, published on arXiv on March 10, 2026, identifies a fundamental training flaw present in every mainstream large language model. The paper, titled "Lost in Backpropagation: The LM Head is a Gradient Bottleneck" (arXiv:2603.10145), focuses on the LM head — the final linear projection layer that maps a model's internal hidden representations of dimension D to logits over a vocabulary of size V, where D is much smaller than V. While this dimensional mismatch has long been recognized as the "softmax bottleneck" limiting a model's expressive capacity, Godey and Artzi show it is simultaneously an optimization bottleneck: backpropagating V-dimensional gradients through a rank-D linear layer causes unavoidable lossy compression, destroying 95 to 99 percent of the gradient norm before it reaches the rest of the network.

The authors support their claims with both theoretical analysis and controlled empirical experiments. Their theoretical framework demonstrates that a large portion of gradient information is inevitably projected into the null space of the LM head weight matrix and discarded. In practice, they trained 2-billion-parameter models with varying effective output ranks and found training efficiency degraded by up to 16 times for the same Transformer backbone as the bottleneck worsened. A synthetic benchmark called SpamLang — designed to isolate optimization effects from expressivity effects — showed that even trivially simple patterns become unlearnable when the gradient bottleneck is present, cleanly establishing causality rather than mere correlation.

Every major autoregressive language model trained today uses the same single linear projection followed by a softmax, regardless of whether the underlying architecture is a Transformer, a state-space model, or something else. That design monoculture means the bottleneck is universal. For frontier labs spending hundreds of millions of dollars on pretraining runs, the paper raises a pointed question: how much of that compute is generating useful gradient signal, and how much is being discarded at the output layer before it can shape the weights that matter?

Godey has been building toward this conclusion for years. His 2023 work proposed "Headless Language Models" that replaced the LM head with a contrastive objective and achieved up to 20 times compute reduction. A 2024 EACL paper showed that representation anisotropy is intrinsic to Transformer architectures. A second 2024 paper linked the softmax bottleneck to saturation in small language models. The current work, his first major output from a postdoctoral position in Artzi's lab at Cornell, adds a third distinct failure mode — degraded optimization dynamics — to the case that the LM head as currently designed is a systematic weak point in modern language model training.