Meet the Talkie-1930: An Open-Weight 13B LLM in Pre-1931 English Literature for Historical Consulting and General Research

admin 6 hours ago

0 0 5 minutes read

Meet the Talkie-1930: An Open-Weight 13B LLM in Pre-1931 English Literature for Historical Consulting and General Research

What if the language model has never heard of the Internet, smartphones, or World War II? That’s not a myth – that’s exactly what the research team led by Nick Levine, David Duvenaud, and Alec Radford. They called it speak upand it may be the largest disciplinary language model ever released to the public.

Talkie is a 13-billion-parameter open-weighted language model trained specifically on pre-1931 English text. This project is being developed by a non-profit team and presents what researchers call a “vintage language model” – LM with solid knowledge not associated with training, but at some point in history.

What Exactly Is a Vintage Language Model?

To understand talkie, first you need to understand the concept behind it. Many modern LLMs like GPT-4, LLaMA, Mistral etc. trained for the massive crawling of the modern web. Their knowledge reflects the world as it exists today, or from the day they ended their training. A classical language model flips this on its head: it is deliberately trained only on historical data so that its “worldview” is set somewhere in the past.

On a walkie-talkie, that’s disconnected December 31, 1930 — well-chosen because that is the date the works entered the public domain in the United States, making the pre-1931 text legally applicable to the discipline.

Model – officially named talkie-1930-13b-base – was trained 260 billion tokens of English literature before 1931, including books, newspapers, periodicals, scientific journals, patents, and case law. Interview test site after training separately, talkie-1930-13b-itit is also available for joint use. The team has set up a 24/7 live demo at talkie-lm.com/chat where Claude Sonnet 4.6 continues to introduce a command-activated model, allowing visitors to view the talkie’s voice and information in real time.

Why Model From 1930?

This is not a nostalgia project. The research team identified several practical, technically meaningful use cases that make the talkie interesting to the AI research community.

1. Regular non-polluting tests: Benchmark contamination, where test data indirectly feeds into training data — one of the persistent and underappreciated problems in modern LLM testing. Because the talkie was trained only on the pre-1931 model, it is not tainted by construction relative to any modern benchmark. This opens up a neat experimental setting to test how well the LM can generalize beyond its previous training data. For example, the team tested whether the talkie could learn Python – a language that did not exist in 1930 – by providing several examples of demonstrations within the content. Using the HumanEval benchmark, they found that while the vintage models dramatically underperformed the web-trained models, they “slowly improve on this task at scale.”

2. To test predictability and temporary surprise: Inspired by Calcifer Computing’s work on Temporal Language Models, the research team used a walkie-talkie to measure to be surprised (measured in bits per byte) of descriptions of historical events from New York Times“On This Day” feature. The events after 1930 – the termination of the talkie information – remain the most striking of the model, with a more pronounced effect in the events of the 1950s and 1960s, followed by a plateau. This creates a systematic setup for studying how predictive ability scales with model size and how performance degrades over longer time horizons.

3. LLM identity and personality: Because the talkie was trained in a very different distribution than any modern model, it opens up questions about what constitutes the “identity” of LLM. Modern LLMs – regardless of their provider – all share a common ancestor in web data, whether through direct training or through distillation and synthetic data pipelines. Talkie breaks that list completely, giving researchers a tool to assess which behaviors and skills exist across language models versus modern web training artifacts.

The Training Pipeline: What Makes This Difficult

Modeling an ancient language is not as simple as sorting a modern dataset by date. The talkie research team encountered several non-trivial engineering challenges.

A temporary leak it is the most serious. If any post-1930s text enters the training corpus — with erroneous documents, or older texts with anachronistic editorial introductions — the historical credibility of the model is compromised. The previous version of the 7B talkie was clearly aware of the Roosevelt presidency and the New Deal legislation, showing incomplete filtering. His team a standard n-gram-based anachronism classifier document sifting through the corpus, but admit that this is not complete – version 13B retains some awareness of World War II and the post-war program.

Data quality another major obstacle. Because there was no digital publication in the 1930s, every token in the talkie training corpus had to be written in physical sources with optical character recognition (OCR). In a controlled experiment, the team found that training on text written by conventional OCR systems produced only 30 percent of reading well of a model trained on human-written versions of the same text. A simple regex cleanup improved that to 70%, but a significant gap remained. To close it, they create devotion a vintage OCR program well prepared with the structures of historical documents.

Vintage post-training: instruction repair phase — we need to build a completely new pipeline from scratch. Using a contemporary instruction-response pair will incorporate contemporary expectations into the model’s behavior. Instead, the team created response pairs of instructions from systematic historical texts: etiquette manuals, literary manuals, cookbooks, dictionaries, encyclopedias, and collections of poetry and fiction. Then they ran online direct preference optimization (DPO) using Claude Sonnet 4.6 as a judge, we improve the rating of the talkie following instructions from 2.0 to 3.4 on a five-point scale. The last round of supervised fine-tuning used the synthetic dialogues of many samples of the rejection created between Claude Opus 4.6 and the talkie.

Benchmarks: How Does a 1930’s Model Stack Up?

To provide meaningful context, the research team trained a “modern twins” – the same 13B model trained on modern web data (FineWeb) – and compared it to a talkie. Unsurprisingly, the talkie underperforms its modern counterpart in standard LM tests. However, when controlling for a question of anachronism – filters questions that point to concepts that would not have existed in 1930 – the performance gap is almost halved. The research team notes an encouraging balance in language comprehension and computational tasks, and attributes the remaining gap primarily to OCR noise and subject distribution differences.

Key Takeaways

Talkie is a “classic language model” with an open weight of 13B trained on 260 billion tokens of pre-1931 English text alone — making it the largest known vintage LM, with a hard cutoff date of December 31, 1930.
Benchmark contamination is removed by design. Because the talkie has never seen modern data, it serves as a particularly clean testing ground for general testing — including whether a model with no experience in digital computing can learn to write Python code from examples within the context alone.
Creating a classic LM is more difficult than sorting by date. The research team had to solve temporal leakage (post-1930 data entry), OCR noise reducing training efficiency to just 30% of human-written text, and build a post-training pipeline entirely on pre-1931 sources such as manuals and encyclopedias.
Two checkpoints are publicly available under Apache 2.0: talkie-1930-13b-base to finish the uncooked again talkie-1930-13b-it for discussion – but using them locally requires a CUDA GPU with at least 28 GB VRAM.
Big models are coming. The research team is targeting a vintage model of the GPT-3 level in the summer of 2026, with a corpus they estimate could reach more than a trillion tokens – potentially enough to match the power of the original ChatGPT, frozen in 1930.

Check out Model weights, Repo again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

admin 6 hours ago

0 0 5 minutes read