A paper published in the journal Machine Learning by researchers Bei Zhou and Soren Riis has identified a structural failure in the self-play reinforcement learning methods powering DeepMind's AlphaGo and AlphaZero. The research shows that while these techniques achieve superhuman performance on chess and Go, they collapse on "impartial games" — a mathematically defined class where both players share identical pieces and move rules. The test case is Nim, a matchstick removal game. After 500 training iterations on a seven-row Nim board, the AlphaZero-style AI performed no better, statistically, than a system picking moves at random.

The failure is structural, not incidental. Playing Nim optimally requires computing a parity function over the board state — a global discrete property that cannot be approximated by matching local features or correlating board positions with win probabilities. That correlation-through-self-play loop is exactly what AlphaZero does, and the two approaches are mathematically incompatible. Zhou and Riis argue that gradient-descent-trained networks can learn pattern association but cannot learn symbolic parity over an arbitrary number of variables. The finding generalizes further through Sprague-Grundy theory, which shows that any position in any impartial game can be reduced to an equivalent Nim configuration — so the collapse applies not to matchsticks specifically but to the entire category.

The chess section of the paper is pointed. Zhou and Riis note that analogous vulnerabilities exist in chess-playing AIs — long mating combinations requiring extended chains of forced moves are routinely missed — but Nim-like configurations are rare enough in over-the-board chess that the flaw stays hidden. That obscurity has likely slowed recognition of how deep the limitation runs.

The gap becomes harder to ignore as Alpha-style architectures get applied to mathematics and theorem proving. Those domains are built on exactly the kind of structured logical reasoning — chains of deductive steps, parity checks, global symbolic constraints — that self-play gradient descent cannot handle by design. Zhou and Riis make the theoretical case for hybrid neuro-symbolic systems that pair statistical learning with explicit symbolic computation, and give researchers a precise, falsifiable criterion for predicting where gradient-trained models will fail before deployment.