Callum McDougall, an AI safety researcher at Anthropic, just published a thorough technical explainer on KL divergence. Worth your time if you've ever struggled with the concept. The post, hosted on his blog Perfectly Normal, walks through six and a half different ways to build intuition around this core measure from information theory.
And it works. KL divergence is genuinely confusing at first. It's asymmetric. It can blow up to infinity when probabilities get close to zero. McDougall tackles these properties by approaching the same math from multiple angles.
KL divergence shows up everywhere in machine learning. Training language models involves measuring how much a learned distribution differs from reality. McDougall frames it through expected surprise, hypothesis testing, maximum likelihood estimation, suboptimal coding, gambling games, and Bregman divergence. The through-line is always the same: D_KL(P || Q) measures the cost of using model Q when P is actually true. That's why it's not symmetric. You care about how much P and Q differ in the world where P is true, not some average of both directions.
The Hacker News discussion zeroed in on the coding and gambling analogies. Here's one way to think about it: if an ISP knows the true distribution of your data but compresses it using a suboptimal model, the KL divergence tells you exactly how many bits they're wasting. Or take the gambling angle. If you believe the odds are Q but they're actually P, KL divergence measures how much money you're leaving on the table with every bet. Simple once you see it.
Commenters also pointed out that the maximum likelihood estimation perspective deserves more attention. Minimizing KL divergence and finding the maximum likelihood estimator are the same thing. That connects information theory straight to basic statistical inference.
McDougall isn't just a casual blogger. He created the ARENA program for training alignment researchers, previously worked at DeepMind, and has ties to Oxford's Future of Humanity Institute. If you work with language models or probabilistic systems, having solid mental models for KL divergence matters. This post collects more useful perspectives in one place than most textbooks manage.