Synthetic Grid Sequences Outperform 10× Natural Language Data in LLM Pre-Training

Two MIT researchers have found a way to make language models smarter before they ever see a word of human text — by training them first on sequences generated from abstract grid simulations called Neural Cellular Automata.

The technique inserts a "pre-pre-training" stage before standard language training on web text, math, or code. The model processes synthetic token sequences produced by randomly sampled neural networks running on a grid. The sequences contain no language. What they contain is pattern — each sequence encodes a hidden rule that the model can only recover by attending carefully to context.

The results, from Seungwook Han and Pulkit Agrawal, make a strong case on data efficiency grounds. Just 164 million NCA tokens before language training produced a 6% average perplexity reduction and 1.6× faster convergence compared to training from scratch. More telling: those 164 million synthetic tokens outperformed 1.6 billion tokens of real web text — the C4 dataset — on final perplexity. That is a 10× advantage by token count at matched quality. Reasoning benchmarks followed the same direction. GSM8K math accuracy rose from 3.82% to 4.36%, HumanEval code accuracy from 6.75% to 7.49%, and BigBench-Lite general reasoning from around 21% to 27%.

The mechanistic argument is that internet text is structurally redundant. Co-occurrence patterns, human idioms, and cultural repetition let models approximate reasoning through shortcut learning without actually building the underlying circuits. Abstract dynamical systems remove all of that, leaving only structure. Ablations showed that attention layers carry the most transferable computation from NCA pre-pre-training, while MLP layers stay more domain-specific — consistent with the idea that the technique works by accelerating the formation of induction heads, the attention circuits that power in-context learning across modern LLMs.

The timing is not incidental. High-quality natural language training data is broadly expected to run short around 2028, a constraint increasingly discussed as a ceiling on frontier model scaling. If synthetic data can handle the structural scaffolding before language training begins — and do it at a tenth the token cost — it changes the economics of building the reasoning models that sit at the core of autonomous agent systems. Han and Agrawal also show that NCA complexity, measurable by gzip compression ratio, can be tuned per domain: simpler dynamics appear to benefit code models, richer dynamics help math and web reasoning. That gives labs a concrete lever when developing specialized agent models for enterprise, coding, or scientific applications.