Tokenization and Byte Pair Encoding
Large language models do not read words the way humans do. Before any text reaches an LLM, it must be converted into numerical tokens. Byte Pair Encoding is the algorithm that makes this conversion both efficient and flexible.
Table of Contents 1
Why LLMs Need Tokenization
Neural networks operate on numbers, not text. Your words need to be translated into a sequence of integers before a model can process them. Assigning a number to every word creates an enormous vocabulary where rare words appear too infrequently to learn well. Tokenizing at the character level keeps the vocabulary tiny but makes sequences extremely long.
What you need is something in between: a vocabulary that captures common words as single units but can represent any unseen word by breaking it into smaller pieces. That is what Byte Pair Encoding does.
What Byte Pair Encoding Is
BPE is a subword tokenization algorithm originally developed for data compression in the 1990s. The core idea: start with individual characters, then iteratively merge the most frequent pairs of adjacent tokens into new, larger tokens until you reach a target vocabulary size.
The result is a vocabulary where common words like “the” become single tokens, while rarer words get split into subword pieces. “Tokenization” might become ["token", "ization"], and an extremely rare word might break down to individual characters. Every possible input can be represented.
The Algorithm
BPE requires a training corpus and a target vocabulary size. The vocabulary starts with every unique character plus special tokens like [START] and [END].
From this base, the algorithm builds larger tokens through repeated merging:
- Count how often each pair of adjacent tokens appears in the corpus
- Find the most frequent pair
- Merge that pair into a new token and add it to the vocabulary
- Replace all occurrences of that pair in the corpus
- Repeat until the vocabulary reaches the target size
The BPE merge loop: find the most frequent pair, merge it, and repeat until the vocabulary target is reached.
A Worked Example
Suppose our training corpus is “the cat sat on the mat.” We split everything into characters (with _ as a word boundary marker) and assign each unique character a token ID:
a=1, c=2, e=3, h=4, m=5, n=6, o=7, s=8, t=9, _=10
Each character becomes its own token. Here is the full sequence, where each color represents a different token. Use the toggle to switch between the characters and their numeric IDs:
That is 22 tokens.
Iteration 1. The pair a t appears 3 times (in “cat,” “sat,” “mat”). Merge it into a new token at (ID 11):
Iteration 2. The pair t h appears 2 times. Merge into th (ID 12):
Iteration 3. The pair th e appears 2 times. Merge into the (ID 13):
Three iterations reduced the sequence from 22 tokens to 15. The merging continues until the target vocabulary size is reached. After enough iterations, common subsequences and full words become single tokens, and the encoded sequence gets shorter with each merge.
Interactive: BPE on a Poem
Step through the full algorithm below. Watch characters get counted, then see pairs merge one by one:
Consider every letter, one by one,
The parser seeks the pairs that can be done.
For every single char under the sun,
The counter clicks until the counting's done.
The highest count will mount and then will merge,
To purge the surge of symbols on the verge.
A newer token starts to now emerge,
To urge the data forward with a surge.
The space is placed within a smaller state,
To create a rate that we appreciate.
We update vocab to accommodate,
The data plate we need to translate.
Why BPE Works Well
Common words become single tokens. Words like “the” and “and” merge early, keeping sequences short.
Rare words decompose gracefully. An unseen word does not become [UNK]. It splits into familiar subword pieces: “unhappiness” might become ["un", "happiness"].
Vocabulary stays manageable. BPE vocabularies typically range from 30,000 to 100,000 tokens, keeping the model’s embedding layer reasonable.
Morphological patterns emerge naturally. Frequency-driven merging discovers meaningful units like “ing,” “tion,” and “ed” on its own.2This is also why BPE works across different languages. Frequent character combinations in any script tend to get merged, even without explicit linguistic rules.
Most modern LLMs use some variant of BPE. GPT-2 through GPT-4 use BPE-based tokenizers, as does LLaMA (via SentencePiece).3The training corpus determines how effective a tokenizer is for a given language. A tokenizer trained mostly on English will have more English character pairs merged into single tokens, compressing English text efficiently. The same tokenizer may need significantly more tokens for non-English text, which directly impacts both cost and available context window. If you want to see it in action, the Hugging Face Tokenizer Playground lets you type any text and see how different models split it into tokens.
- Jan Willem
Related articles
- 7m
What is LoRA
LoRA lets you fine-tune large language models by training only small low-rank matrices instead of billions of parameters. Here is how it works. Read → - 11m
Making Claude Code continuously improve itself
A continuous improvement system for Claude Code that extracts insights from coding sessions and turns them into concrete harness improvements. Read →