HAE Summarizes KV Cache Tokens Instead of Pruning, Cuts Error 3x

KV cache compression just got more interesting. Jayanth Chandra's new research introduces HAE (Hierarchical Attention Entropy), a method that summarizes tokens instead of deleting them. The approach tackles a real problem: as LLMs push toward million-token context windows, the KV cache grows linearly with sequence length, and traditional Top-K eviction strategies fail unpredictably. Tokens that look unimportant now can become critical anchors later, and pruning them causes big reconstruction errors right where you need accuracy most.

The SRC pipeline (Selection-Reconstruction-Compression) works in three stages. First, it calculates Shannon Entropy of attention weights for each token. High-entropy tokens with diffuse attention get routed to a "Recycle Bin" while low-entropy "anchor" tokens stay in the Active Cache. Then comes the clever part: instead of just throwing away the binned tokens, HAE treats them as a cluster to be represented by a centroid token, solving an Ordinary Least Squares problem using the Moore-Penrose pseudoinverse. Finally, Singular Value Decomposition compresses these reconstructed tokens into compact representations that preserve their functional contribution.

The benchmarks look strong. HAE achieves 3x lower reconstruction error than Top-K at a 30% keep ratio while actually using less memory in real-world testing. But there's a real trade-off that Hacker News commenters correctly flagged: OLS and SVD are computationally expensive operations. The research doesn't include end-to-end inference latency benchmarks, which matters a lot for practical deployment. Lower error is good, but not if it makes inference much slower. Until someone benchmarks actual inference latency, this stays in the research pile.