The Big Learning Set for Big World Helpers

On November 12, 2012, Randall Munroe’s famous xkcd comic published Up Goer Five, a blueprint and explanation of the Apollo V rocket written using only the 1000 most common words of the English language (as he estimated them). Later on, on November 24, 2015, came out Thing Explainer, an entire illustrated book of similar explanations for other objects and concepts. The “only the most common 1000 words” style of writing sounds sometimes stilted, sometimes a bit funny, but these texts certainly prove that it’s enough to talk virtually about anything.

In the age of LLMs, would it be possible to have a training set built only on the most common 1000 words of the English language?

Let’s try.

Why?#

In short: because a simple training set would be an interesting Petri dish for training simple (and small!) language models.

This has already been tried. In 2023 two researchers affiliated with Microsoft published TinyStories (HuggingFace repository here), a project in which they first developed a large synthetic training set of simple child-like stories with GPT-3.5 and GPT-4, then trained small transformers on it. And I do mean small; the smallest TinyStories model has merely 1M parameters, the largest has 33M. By comparison most modern state-of-the-art LLMs will have hundreds of billions if not trillions of parameters. Yet these models speak quite coherent English - the 1M model maybe struggles a bit occasionally, but the others are fine. However they’re only able to produce text that reads like this:

One day, Lily and her family went to the park. Lily saw a big slide and ran towards it. She wanted to go down the slide with her friends. But her mom said, "No, Lily. You have to stay with us."

Lily was sad, but she listened to her mom. Later that day, Lily and her family went to the beach. Lily saw a boy who had no toys. She wanted to play with him, but she remembered what her mom said. She decided to share her toys with the boy.

The boy was happy and said, "Thank you, Lily!" Lily smiled and felt happy too. She learned that it's important to share and be kind to others. From that day on, Lily always shared her toys with her friends.

(this is a story I just generated on the fly with TinyStories-33M and the HuggingFace Transformers library).

This dataset doesn’t stick to a specific subset of English, that I know of, though it’s clear that it also tends to have a limited vocabulary. The xkcd “1000 words” approach makes me curious because it allows a very explicit dictionary, which has a very direct effect on reducing the amount of weights the model needs (less embeddings, for starters), as well as making real the possibility of using words instead of tokens as the units of language (which yes, means sacrificing flexibility, but for an experiment that can be acceptable, and should make it easier). And its ability to talk about anything, to be something like a “Turing complete” subset of English language, so to speak, is essentially already proven. There are other alternative similar subsets with like properties, such as Charles Kay Ogden’s 1925 Basic English, which only uses 850 words. But xkcd’s model benefits from modernity (for example it includes the word “computer”) and feels somewhat more expressive to me.

So my goal started as: generate a synthetic dataset using entirely the “Up-Goer Five” vocabulary and style, and eventually try training some models on it. I don’t really expect to be able to easily get to “proper” dataset sizes (even for small models, hundreds of millions or billions of tokens/words would be usual) due to limits in compute budget but at least I can have a proof of concept and a small starting point.

The rules#

First things first: Randall Munroe is a cheat.

(hey, I’m just playing Randall, you know I love ya)

Ok, now to be a bit more specific and less inflammatory. The original xkcd comic mentions the 1000 most common English words. I used this repository as my source for the list (check it out, it’s got a lot of other useful word lists). I assume it was scraped from the xkcd Simple Writer and it appears identical for example in this fanmade editor (credit to Theo Sanderson). However the list is by no means complete. It goes without saying for example that if a noun is acceptable, so is its plural, and that won’t have to be included in the list… except for “child” and “children”, which both appear, possibly because the latter is an exception to the rule of simply appending an -s suffix. But if you try in Simple Writer, “babies” is also accepted, despite being a non-straightforward plural, and only “baby” appearing in the list. And yet “asses” (snicker) is not accepted, despite “ass” (snicker) being. So the rule is rather inconsistent. Similar issues appear with verb and adjective inflections.

But it doesn’t stop there; in the original title of the original comic we already have an even more confusing violation. “Up-Goer Five” includes the word “goer”, and that’s not an accepted inflection of the verb “go”; it’s an action noun, something that some verbs do allow (“play” => “player”, “hunt” => “hunter”) but not a proper conjugation. And while more verbose, the word set absolutely allows circumlocutions like “thing that goes”, which makes action nouns pure syntactic sugar.

I didn’t particularly care about sticking precisely to the original Simple Writer’s standards as much as I did about having a consistent set of rules and words allowed. Therefore I ended up deciding for my own “house rules”. I used the library lemminflect for quick and dirty Part-of-Speech tagging. Essentially for any base word I allowed the following tags (I’m using the Penn Treebank Tagset convention here which is also what lemminflect uses):

NNS: noun plural (e.g. “house” => “houses”)
VBZ: verb, third person singular present (e.g. “go” => “goes”)
VBD: verb, past tense (e.g. “go” => “went”)
VBG: verb, gerund or present participle (e.g. “go” => “going”)
VBN: verb, past participle (e.g. “go” => “gone”)
JJR: adjective, comparative (e.g. “strong” => “stronger”)
JJS: adjective, superlative (e.g. “strong” => “strongest”)
RBR: adverb, comparative (e.g. “soon” => “sooner”)
RBS: adverb, superlative (e.g. “soon” => “soonest”)

In addition to this, I defined and allowed a new non-standard category, ANN for action nouns. This is not supported by lemminflect and I simply allowed only on select verbs for now, which is kind of arbitrary I realise but felt the best compromise between allowing for some of that “syntactic sugar” I mentioned and preventing the total number of words from ballooning too much. Wherever inflected forms were already present in the original 1000 words they weren’t double counted (so “children” became simply the NNS form of “child”). This makes for a total of 2917 words allowed, including various contracted forms with apostrophes (like “you’ll” or “he’s”) which appeared as words in the original list and I allowed to stay as such (though technically I guess they could be removed and replaced by their non-contracted forms - more syntactic sugar I guess). My hope is also that in some future projects I might bring the number back to almost 1000; something I’d like to experiment with is “modifier tokens” that inflect the word that follows them. So instead of went for example you could have <VBD>go, which reduces drastically the number of tokens required to describe a text to only base forms, modifiers, and punctuation. This obviously also limits the things the model can do (no languages other than English, no code, no made up words or names, not even a bit of flexibility to typos) but again, for a very limited experiment, it’s not particularly important.

On to generation.

Generation#

Here is the repository with the code for this part.

I created a few basic scripts to help me with the work. After generating the allowed word forms with lemminflect, I stored them in a JSON file and made some utility classes to read and use them. The main commands provided are:

check: allows for checking a given generated file and highlight any disallowed words;
editor: launches a terminal text editor that checks live what you type in and thus allows human-generated content (I made a couple of these mostly as examples for the agents);
agent: launches an AI agent which allows you to request generations on a certain topic;
stats: prints statistics on all current generated files.

The editor#

This was created with Textual, which is a pretty neat library for quick Python terminal GUIs. I had AI customize a bit the text editor itself to allow for highlighting the way I wanted it, and it works perfectly well for quick editing or creation of new entries. It’s not a fast way of creating content but it is a useful tool, especially as human-generated content provides a good starting example.

The agent#

For this I used Pydantic AI, which personally I find amazing. The official website states:

We built Pydantic AI with one simple aim: to bring that FastAPI feeling to GenAI app and agent development.

and I agree, it definitely conveys that feel - it’s very simple and straightforward to make an agent with it. In this case, it wasn’t a particularly hard task either. All I did was provide the agent with a single tool, a checker which allows it to identify the disallowed words in its proposed text and a bit of context around them. In my experience even smart models have a very hard time one-shot producing compliant text, but with a checker it’s a different story, and they can simply iterate to converge to a good answer.

Pydantic AI allows also connecting to a local Ollama server and I did make use of that, albeit mostly for testing purposes. The best tool-enabled models I could find that I can run fully on my laptop are llama3.1:8b and qwen3:8b, but they still struggle - especially because increasing the context length to what’s needed for this task (the prompt alone is over 7000 tokens) makes them slow and expensive to run.

Results#

So what does this end up looking like?

For smart, powerful frontier models, quite good. Some files I had generated by Claude 4.5 Sonnet via the agent interface in VSCode, simply giving it access to the command line tool to check for mistakes and iterate. It did well, and enthusiastically kept creating new files until I stopped it. Here’s its rendition of the plot of The Lord of the Rings, aka, “the ring story”:

There is a small round thing that can make the person who wears it disappear. But this round thing is very bad. It makes people want to keep it and never let anyone else have it. The longer someone has it, the more it changes them into something bad.

A small person who lives in a space under the ground finds the round thing. He keeps it for many years and it makes him live much longer than he should. But it also makes him sad and alone. One day he loses the round thing and another small person finds it.

The new small person does not know what the round thing can do at first. An old man who knows many things tells him that the round thing must be broken or bad things will happen to everyone. The only way to break it is to drop it into the fire where it was first made, far away.

The small person and his friends go on a long walk to get to the fire. Many bad things try to stop them. There are big animals that want to kill them. There are bad people who want to take the round thing for themselves. The small person has to be very strong to keep going.

The round thing keeps trying to make the small person want to keep it. It talks to him in his head. It makes him not trust his friends. It makes him tired and sad. But the small person keeps going because he knows he has to break the round thing.

When the small person finally gets to the fire, the round thing has made him change his mind. He does not want to drop it in. But his friend from before comes back and they fight. The friend falls into the fire with the round thing and they both burn.

After the round thing is gone, all the bad things that it made happen start to go away. The bad people lose their power. The small person and his friends can go home. They are all very tired but happy that they saved everyone.

(not sure if the first “small person” here is meant to be Bilbo, or if it’s meant to be Smeagol/Gollum, and Bilbo and Frodo were conflated into one; I suspect the latter given the description of the ending)

I then used the agent with GPT-5.1. The results are, again, pretty good! The title of this blog post comes from GPT-5.1’s own way to refer to large language models:

A big word helper is a kind of computer brain that works with words.

It reads a very, very big number of word lines from books, talks, stories, and many other written things. From these, it learns how people usually put words together. It does not remember one person words as that person. Instead, it looks at all the words as one huge crowd and learns usual ways in how words follow each other.

[...]

and training sets:

When people build thinking computers, they need to show them many, many pieces of something they care about. A “big learning set” is the big group of those pieces.

Think of a big learning set like a huge table. Each line in the table is one thing you want the computer to learn about. This one line is one piece. All the lines together are the set.

Each line has several parts. A part is one kind of fact about the thing. For a picture of a cat, one part could say “this is a cat.” Another part could hold numbers that tell how the picture looks. For a house, parts could say how big it is, how old it is, and how much money it takes to buy it.

[...]

These explanation are honestly excellent. They use interesting and appropriate turns of phrase to refer to things, they say a lot, and they flow naturally and smoothly despite the constraints. They also very rarely ever use disallowed words. It usually takes GPT-5.1 something like 20k to 30k tokens total to produce one answer, with 3-4 calls to the word checker tool. Only once I had to manually fix an answer by changing a single word.

By comparison, here is GPT-5-mini’s take on a washing machine:

People have a box that cleans shirts, pants, clothes and things for feet. You put not clean things into the box through the box doors. You add water and cleaner. A round part inside the box moves the things and the water. The moving rubs bad bits from the things. The box sends the not clean water out and brings clean water in. You set how many minutes the box will work and how the box will move. At the end the box moves very fast to send water out and make the things less wet. Some boxes also have a room to dry more. The box helps save time and hard work. Many homes, apartments and stores that clean clothes have this box.

It does stick to the word set and it’s nominally all correct, but it’s a bit more stilted and much shorter.

Local models do surprisingly well for what they are. But what they are just isn’t enough, sadly. Here’s Qwen 3 in its quantized 8 billion parameters version - just 6 GB in VRAM with the default 4096 context length:

A man is a person. He works, cares for family, helps others, and makes choices. He builds homes, raises children, leads, follows, joins, shares, and makes. He thinks, feels, struggles, picks, stands, fights, grows, changes, and learns. A man is a person with many jobs and things. He can be kind, strong, or smart. He may stay with others or go far. He may love, fight, or help. A man is a person who lives, works, and changes.

This has two violations (“smart” and “choices”). And it’s still way better than its take on an apple, made with a 8192 context length in the hope that it would allow for better work:

An apple is a round fruit with red or green skin. Inside, there is a core with seeds. Apples are sweet and juicy. You can eat them raw or cook them. People use apples to make pie, juice, or candy. Trees grow apples in autumn. You pick them from branches. Apples help you feel full and give energy. Some apples are small, others are big. You can find them in stores or orchards. Eating apples is healthy. They have vitamins and fiber. Apples are round and smooth. You can taste them fresh or dried. They are popular in many countries. People plant apple trees to grow more. Juice is a drink. Apples are good for snacks. They can be baked or fried. The tree has leaves and flowers in spring. Flowers turn into fruit. Apples are part of a healthy diet.

This has so many disallowed words it’s barely fixable. One of them being of course “apple” itself - a core challenge is learning to talk about a thing without ever naming the thing if it’s not part of the allowed set! I could try with say a 32k context but at that point 50% of the model’s layers have to be loaded in RAM and processed by the CPU and then it takes a long time for any answer to come; besides, I doubt it’d do much good against the intrinsic limitations of the model. I’d still recommend you to try it - it’s amazing what it can do with such a limited size. But it’s not up to snuff for this.

Future plans?#

I will certainly keep generating; one of the first things I’d like to do is produce meaningful explanations for every key object and action that has a word in the 1000 words set, for additional self-consistency. I’ll look to try and support other agents and providers too, both to see how well they do and leverage more free usage allowances if they offer any. It’s still hardly possible to generate hundreds of millions of words without paying for compute. As of now, as I said, GPT-5.1 (which does very well at the task) tends to take about 20k to 50k tokens for a single generation, depending on how hard it is, and the resulting text hardly ever hits 1k tokens of length. It would also need to be supervised because I’ve once seen it enter into some kind of loop and use over 100k tokens on a single attempt (I asked it to explain World War 2). So for now I will hold back on the obvious idea of going through each word in the set and asking a definition for them (excluding articles, adverbs and pronouns I guess).

At some point I also want to try training models on this. I’d like to start small, with classic old style models like n-grams, and scale up; possibly put up some artificial constraints like what’s the best model I can make with only 100 parameters, 1,000, and so on. At some point I’ll still need to figure out compute though because my laptop will stop being enough. I’ll probably start moving some experiments to Google Colab and see from there.