164M Tokens of Cellular Automata Beat 1.6B Tokens of Natural Language in LLM Pretraining

A team at MIT's Improbable AI Lab has found that 164 million tokens of synthetic grid sequences generated by Neural Cellular Automata (NCA) outperform 1.6 billion tokens of natural language as pre-training data for language models — a 10x token efficiency advantage that holds across perplexity and downstream reasoning benchmarks.

The paper, by Seungwook Han, Akarsh Kumar, Pulkit Agrawal, and Dan Lee, challenges the assumption that pre-training value comes from linguistic richness. Their argument: what matters is structural complexity — the kind that forces a model to infer hidden rules rather than exploit surface-level patterns. NCA sequences are abstract spatiotemporal grids, conceptually similar to Conway's Game of Life but with neural networks replacing fixed rules. Each sequence is governed by a unique latent rule the model has never seen. To predict the next token, it has to figure out what rule is running. The authors argue that process directly builds the in-context learning circuits that general reasoning depends on.

The perplexity improvements are consistent across domains: 5.7% on OpenWebText, 5.2% on OpenWebMath, and 4.2% on CodeParrot, all under matched 164M token budgets. On reasoning benchmarks: GSM8K goes from 3.82% to 4.36%, HumanEval from 6.75% to 7.49%, BigBench-Lite normalized accuracy from 20.91% to 26.51%. When C4 natural language pre-pre-training is scaled to 1.6 billion tokens — ten times the NCA budget — NCA still achieves 5% better final perplexity and converges 1.4 times faster.

The mechanistic picture is specific. Ablations show that attention layers, not MLP layers, carry the transferable benefits from NCA pretraining — consistent with the induction head literature on how in-context learning emerges. The team also found that NCA sequence complexity, tunable via gzip compression ratio, needs to match the target domain: simpler dynamics transfer better to code, while more complex dynamics benefit math and web text. That is a new axis for data curation with no dependence on linguistic content at all.

Two pressures sit behind this research. High-quality internet text is projected to become scarce around 2028. Agrawal's lab has also published a cluster of papers arguing that next-token prediction on natural language couples reasoning with knowledge in ways that limit progress toward general intelligence — synthetic data that separates the two is the proposed fix. The paper puts its core claim plainly: "the structural complexity of pre-training data, rather than its semantic content, is the key driver of transferable reasoning." Whether that advantage survives at larger compute scales is what the authors leave for future work — and what foundation model developers will want to see answered before rearchitecting their pipelines.