Talkie: the 13B language model stuck in 1930

Nick Levine, David Duvenaud, and Alec Radford have released Talkie-1930-13b, a 13 billion parameter language model trained exclusively on 260 billion tokens of English text published before 1931. The model has never seen a webpage. Never read about World War II. Has no concept of digital computers.

It's the largest "vintage" language model built to date. And the team plans to keep scaling. A GPT-3-level version is targeted for release this summer. They estimate over a trillion tokens of historical text are available, which puts GPT-3.5-level training within reach.

The research, published on talkie-lm.com, treats the model as a testbed for prediction and generalization. They measured how "surprised" Talkie was by nearly 5,000 historical events from the New York Times' "On This Day" feature. Events after 1930 were more surprising. The gap peaked in the 1950s and 1960s. Then it leveled off. They also tested whether the model could independently arrive at post-1930 inventions like the helicopter, Turing machines, and xerography. As Demis Hassabis has asked in similar contexts: could a model trained only up to 1911 discover General Relativity on its own?

Because Talkie has zero exposure to modern data, it's contamination-free by construction. The researchers gave it a Python programming test (HumanEval) using only in-context examples, despite the model having never encountered digital computers. It performed poorly compared to modern models trained on web data that includes code. But it improved steadily with scale. Every correct solution was either a simple one-line program or a small modification to an example. In one case, when given a rotation cipher encoding function, the model produced the decoding function by swapping an addition for a subtraction. Some grasp of inverse functions, maybe.

For evaluation and post-training, the team used Claude Sonnet 4.6 and Claude Opus 4.6. All models are available on GitHub and Hugging Face.

The project joins growing interest in vintage language models, alongside efforts like Ranke-4B, Mr. Chatterbox, and Machina Mirabilis. The broader question driving this work: how much of what we think we know about LLMs is actually about language and cognition, versus how much is just about the web as a single dataset? Training on different source material produces different kinds of models. Studying those differences could tell us something real about how these systems work.