Tokenization and Byte Pair Encoding

Large language models do not read words the way humans do. Before any text reaches an LLM, it must be converted into numerical tokens. Byte Pair Encoding is the algorithm that makes this conversion both efficient and flexible.

Table of Contents ¹

Jump between sections with ⌘⇧J .

Why LLMs Need Tokenization

Neural networks operate on numbers, not text. Your words need to be translated into a sequence of integers before a model can process them. Assigning a number to every word creates an enormous vocabulary where rare words appear too infrequently to learn well. Tokenizing at the character level keeps the vocabulary tiny but makes sequences extremely long.

What you need is something in between: a vocabulary that captures common words as single units but can represent any unseen word by breaking it into smaller pieces. That is what Byte Pair Encoding does.

What Byte Pair Encoding Is

BPE is a subword tokenization algorithm originally developed for data compression in the 1990s. The core idea: start with individual characters, then iteratively merge the most frequent pairs of adjacent tokens into new, larger tokens until you reach a target vocabulary size.

The result is a vocabulary where common words like “the” become single tokens, while rarer words get split into subword pieces. “Tokenization” might become ["token", "ization"], and an extremely rare word might break down to individual characters. Every possible input can be represented.

The Algorithm

BPE requires a training corpus and a target vocabulary size. The vocabulary starts with every unique character plus special tokens like [START] and [END].

From this base, the algorithm builds larger tokens through repeated merging:

Count how often each pair of adjacent tokens appears in the corpus
Find the most frequent pair
Merge that pair into a new token and add it to the vocabulary
Replace all occurrences of that pair in the corpus
Repeat until the vocabulary reaches the target size

The BPE merge loop: find the most frequent pair, merge it, and repeat until the vocabulary target is reached.

A Worked Example

Suppose our training corpus is “the cat sat on the mat.” We split everything into characters (with _ as a word boundary marker) and assign each unique character a token ID:

a=1, c=2, e=3, h=4, m=5, n=6, o=7, s=8, t=9, _=10

Each character becomes its own token. Here is the full sequence, where each color represents a different token. Use the toggle to switch between the characters and their numeric IDs:

the_cat_sat_on_the_mat

That is 22 tokens.

Iteration 1. The pair a t appears 3 times (in “cat,” “sat,” “mat”). Merge it into a new token at (ID 11):

the_cat_sat_on_the_mat

Iteration 2. The pair t h appears 2 times. Merge into th (ID 12):

the_cat_sat_on_the_mat

Iteration 3. The pair th e appears 2 times. Merge into the (ID 13):

the_cat_sat_on_the_mat

Three iterations reduced the sequence from 22 tokens to 15. The merging continues until the target vocabulary size is reached. After enough iterations, common subsequences and full words become single tokens, and the encoded sequence gets shorter with each merge.

Interactive: BPE on a Poem

Step through the full algorithm below. Watch characters get counted, then see pairs merge one by one:

Training corpus

This is our training corpus — a short poem about BPE. We will build a vocabulary from its characters and then compress it using Byte Pair Encoding.

Input

Consider every letter, one by one,

The parser seeks the pairs that can be done.

For every single char under the sun,

The counter clicks until the counting's done.

The highest count will mount and then will merge,

To purge the surge of symbols on the verge.

A newer token starts to now emerge,

To urge the data forward with a surge.

The space is placed within a smaller state,

To create a rate that we appreciate.

We update vocab to accommodate,

The data plate we need to translate.

Vocabulary

Click "Build a vocabulary" to begin.

Why BPE Works Well

Common words become single tokens. Words like “the” and “and” merge early, keeping sequences short.

Rare words decompose gracefully. An unseen word does not become [UNK]. It splits into familiar subword pieces: “unhappiness” might become ["un", "happiness"].

Vocabulary stays manageable. BPE vocabularies typically range from 30,000 to 100,000 tokens, keeping the model’s embedding layer reasonable.

Morphological patterns emerge naturally. Frequency-driven merging discovers meaningful units like “ing,” “tion,” and “ed” on its own.²This is also why BPE works across different languages. Frequent character combinations in any script tend to get merged, even without explicit linguistic rules.

Most modern LLMs use some variant of BPE. GPT-2 through GPT-4 use BPE-based tokenizers, as does LLaMA (via SentencePiece).³The training corpus determines how effective a tokenizer is for a given language. A tokenizer trained mostly on English will have more English character pairs merged into single tokens, compressing English text efficiently. The same tokenizer may need significantly more tokens for non-English text, which directly impacts both cost and available context window. If you want to see it in action, the Hugging Face Tokenizer Playground lets you type any text and see how different models split it into tokens.

- Jan Willem

#tokenization #BPE #NLP #LLMs #subword-tokenization

2024.01.10 7m

What is LoRA

LoRA lets you fine-tune large language models by training only small low-rank matrices instead of billions of parameters. Here is how it works. Read →

2026.05.09 5m

Agent-Era Codebase Principles

An updated, consolidated list of principles for designing codebases that work well with coding agents - from scarcity to harness contracts. Read →

# Why LLMs Need Tokenization

# What Byte Pair Encoding Is

# The Algorithm

# A Worked Example

# Interactive: BPE on a Poem

# Why BPE Works Well

Related articles

What is LoRA

Agent-Era Codebase Principles