Bruno Gavranović at the Cybercat Institute has laid out a problem that anyone working with AI code generation has probably felt but struggled to articulate: LLMs don't actually understand types. They're trained to predict the next token, producing output of type List Token, and everything after that is patching. The typechecking happens after the fact, divorced from training entirely. It's like teaching someone to write English prose and then hoping they'll accidentally produce valid Python.

The industry has two main workarounds. Retry loops generate code, try to compile it, and feed errors back if it fails. Constrained decoding masks out invalid tokens before each sampling step. Both work, sort of. FrontierMath scores jumped from under 2% to almost 50% in two years. GPT-4 plays chess at intermediate level just from general training. The problem is that retry loops are wasteful, catching errors only after full generation completes. Constrained decoding forces models into low-probability branches, producing output that typechecks but drifts toward nonsense. Neither approach updates the model's weights. The next session starts from the same blind state.

AlphaZero is the proof that training with structure matters. It uses game structure during training and hits superhuman ELO above 3400 with roughly 30x fewer parameters than GPT-4. It never makes an illegal move. GPT-4 makes one in 30% of games. Encode the rules of dependently-typed languages like Idris, Lean, and Agda, and code generation could see the same Vera, a language built for LLMs. Whether this leads to practical training methods remains an open question.