Tokenization: BPE, WordPiece, SentencePiece & Tiktoken
Why Tokenization Matters for Ops
Every large language model sees tokens, not characters or words. Tokenization is the first transformation applied to raw text before it ever reaches an embedding layer. For MLOps engineers, tokenization directly controls three operational levers: cost (API pricing is per token), latency (sequence length drives quadratic attention cost), and correctness (a mismatch between training and serving tokenizers silently degrades quality).
At a high level, a tokenizer maps a string to a sequence of integer IDs from a fixed vocabulary. The vocabulary is learned during a training phase on a large corpus and then frozen. At inference time the tokenizer is deterministic — the same string always produces the same token sequence.
The four dominant tokenization families in production LLMs today are Byte-Pair Encoding (BPE), WordPiece, SentencePiece (with its Unigram model), and Tiktoken (OpenAI's optimised BPE implementation). We will examine each in detail.
BPE Algorithm Walkthrough
Byte-Pair Encoding was originally a data-compression algorithm (Gage, 1994) and was adapted for subword tokenization by Sennrich et al. (2016). It is the foundation of GPT-2, GPT-3, GPT-4, LLaMA, and Claude tokenizers. The idea is elegantly simple: start with individual bytes (or characters) and iteratively merge the most frequent adjacent pair until the vocabulary reaches a target size.
Training Phase
- Initialise vocabulary with all individual bytes (256 entries for byte-level BPE).
- Count all adjacent pairs across the corpus.
- Merge the most frequent pair into a new token and add it to the vocabulary.
- Repeat steps 2–3 until vocabulary reaches the desired size (e.g., 50,257 for GPT-2, 100,277 for cl100k_base).
# BPE training — simplified pseudocode def train_bpe(corpus, vocab_size): # Start with byte-level tokens vocab = {bytes(i): i for i in range(256)} splits = [list(word.encode("utf-8")) for word in corpus] while len(vocab) < vocab_size: pairs = count_pairs(splits) # count adjacent pairs best = max(pairs, key=pairs.get) # most frequent pair splits = merge_pair(splits, best) # merge everywhere vocab[best] = len(vocab) # add merged token return vocab
Encoding (Inference) Phase
Given a trained merge table, encoding is a greedy left-to-right process. The input is split into bytes, then merges are applied in priority order (the order they were learned during training). This is deterministic and fast — O(n × m) where n is the input length and m is the number of merges applicable.
WordPiece vs BPE
WordPiece (Schuster & Nakajima, 2012) is the tokenizer behind BERT, DistilBERT, and Electra. It is structurally similar to BPE but differs in the merge criterion. Instead of picking the most frequent pair, WordPiece picks the pair that maximises the likelihood of the training corpus under a unigram language model:
# WordPiece merge scoring # BPE criterion: score = count(a, b) # WordPiece criterion: score(a, b) = count(ab) / (count(a) * count(b))
This means WordPiece favours merges that create subwords which are surprisingly common relative to their parts, not just globally frequent. In practice the resulting vocabularies are similar, but WordPiece tends to keep rarer morphological units intact.
Operational Differences
- Prefix marker: WordPiece uses a
##prefix for continuation tokens (e.g.,play → play,##ing), while BPE uses space-prefixed tokens (e.g.,Ġplayingin GPT-2). - Unknown handling: WordPiece falls back to
[UNK]if a character is not in the vocabulary; byte-level BPE never produces unknowns because every byte is in the base vocabulary. - Vocabulary size: BERT uses 30,522 tokens; GPT-2 BPE uses 50,257; GPT-4 cl100k_base uses 100,277.
SentencePiece & the Unigram Model
SentencePiece (Kudo & Richardson, 2018) is a language-agnostic tokenization
library used by T5, ALBERT, XLNet, and
LLaMA. Its key innovation is treating the input as a raw stream of Unicode
characters (or bytes) without pre-tokenization — no language-specific whitespace or
punctuation rules. The sentinel character ▁ (U+2581) marks word boundaries.
SentencePiece supports two sub-algorithms: BPE and the Unigram Language Model. The Unigram approach works in the opposite direction to BPE:
- Start large: initialise with a very large candidate vocabulary (often all substrings up to a length limit, seeded from the corpus).
- Compute loss: for each candidate token, compute the unigram log-likelihood of the training corpus if that token were removed.
- Prune: remove the tokens whose removal increases loss the least (i.e., they are the least useful), keeping a fixed percentage per iteration.
- Repeat until the vocabulary reaches the target size.
# Unigram LM tokenization (Viterbi decoding) def tokenize_unigram(text, vocab, scores): # Find the segmentation that maximises sum of log-probs n = len(text) best_score = [-float("inf")] * (n + 1) best_score[0] = 0.0 best_edge = [None] * (n + 1) for end in range(1, n + 1): for start in range(end): sub = text[start:end] if sub in vocab: s = best_score[start] + scores[sub] if s > best_score[end]: best_score[end] = s best_edge[end] = start # Back-track to recover tokens tokens, i = [], n while i > 0: tokens.append(text[best_edge[i]:i]) i = best_edge[i] return tokens[::-1]
Tiktoken — OpenAI's Fast Tokenizer
Tiktoken is OpenAI's open-source tokenizer library, first released in late 2022.
It implements byte-level BPE but is written in Rust with Python bindings, making it
3–6× faster than the HuggingFace tokenizers library for encoding.
It is the canonical tokenizer for GPT-3.5, GPT-4, and the embeddings API.
Encoding Names
gpt2— 50,257 tokens (GPT-2, GPT-3)r50k_base— 50,257 tokens (text-davinci-002, code-davinci-002)p50k_base— 50,281 tokens (text-davinci-003, Codex)cl100k_base— 100,277 tokens (GPT-3.5-turbo, GPT-4, text-embedding-ada-002)o200k_base— 200,019 tokens (GPT-4o)
The jump from 50k to 100k tokens in cl100k_base was specifically motivated by better coverage of non-English languages and code. Larger vocabularies compress text more (fewer tokens per sentence), reducing latency and cost at the expense of a larger embedding matrix.
Token Count Estimation & Context Length
Context length is the maximum number of tokens a model can process in a single forward pass. Exceeding it causes either truncation (silent data loss) or an API error. Accurate token estimation is critical for:
- Prompt engineering: fitting system prompt + few-shot examples + user query + completion budget.
- RAG pipelines: choosing how many retrieved chunks fit in context.
- Cost forecasting: pre-computing spend before hitting the API.
- Batching: packing multiple requests to maximise GPU utilisation.
Rules of Thumb
| Language | ~Tokens per Word (cl100k) | ~Chars per Token |
|---|---|---|
| English | ~1.15 | ~4.3 |
| Python code | ~2.0 | ~2.8 |
| Chinese | ~2.5 | ~1.4 |
| Japanese | ~2.8 | ~1.2 |
| JSON / structured data | ~2.5 | ~2.0 |
<|im_start|>, <|im_sep|>)
and chat-ML framing add overhead that len(text) / 4 heuristics miss. For GPT-4
chat completions, each message adds ~4 tokens of structural overhead.
Context Windows by Model
| Model | Context Length | Encoding |
|---|---|---|
| GPT-3.5-turbo | 16,385 | cl100k_base |
| GPT-4 | 8,192 / 32,768 | cl100k_base |
| GPT-4-turbo | 128,000 | cl100k_base |
| GPT-4o | 128,000 | o200k_base |
| Claude 3.5 Sonnet | 200,000 | Proprietary BPE |
| LLaMA 3 (70B) | 8,192 | SentencePiece BPE |
Python Code Examples
Tiktoken — Counting Tokens
import tiktoken # Load the encoding used by GPT-4 enc = tiktoken.get_encoding("cl100k_base") text = "Tokenization is the first step in every LLM pipeline." tokens = enc.encode(text) print(f"Text: {text}") print(f"Tokens: {tokens}") print(f"Count: {len(tokens)}") # Output: # Text: Tokenization is the first step in every LLM pipeline. # Tokens: [3947, 2065, 374, 279, 1176, 3094, 304, 1475, 445, 11237, 15006, 13] # Count: 12
Tiktoken — Decoding and Inspecting
# Decode individual tokens to see subwords for tid in tokens: print(f" {tid:6d} → {enc.decode([tid])!r}") # Output: # 3947 → 'Token' # 2065 → 'ization' # 374 → ' is' # 279 → ' the' # 1176 → ' first' # 3094 → ' step' # 304 → ' in' # 1475 → ' every' # 445 → ' L' # 11237 → 'LM' # 15006 → ' pipeline' # 13 → '.'
Tiktoken — Model-Based Shortcut
# Get the correct encoding for a specific model enc_4o = tiktoken.encoding_for_model("gpt-4o") # → o200k_base enc_4 = tiktoken.encoding_for_model("gpt-4") # → cl100k_base prompt = "Explain quantum computing in simple terms." print(f"GPT-4o tokens: {len(enc_4o.encode(prompt))}") print(f"GPT-4 tokens: {len(enc_4.encode(prompt))}") # GPT-4o tokens: 7 # GPT-4 tokens: 8
SentencePiece — Training a Custom Tokenizer
import sentencepiece as spm # Train a BPE tokenizer on a corpus file spm.SentencePieceTrainer.train( input="corpus.txt", model_prefix="my_tok", vocab_size=32000, model_type="bpe", # or "unigram" byte_fallback=True, # handle unseen chars via bytes character_coverage=0.9995, # cover 99.95% of characters ) # Load and use the trained model sp = spm.SentencePieceProcessor(model_file="my_tok.model") text = "Tokenization handles multilingual text well." pieces = sp.encode(text, out_type=str) ids = sp.encode(text, out_type=int) print(f"Pieces: {pieces}") print(f"IDs: {ids}") # Pieces: ['▁Token', 'ization', '▁handles', '▁multi', 'lingual', '▁text', '▁well', '.']
Budget-Aware Prompt Assembly
import tiktoken def assemble_prompt(system, user_msg, chunks, model="gpt-4", max_ctx=8192, reserve=512): """Pack as many RAG chunks as fit within the context budget.""" enc = tiktoken.encoding_for_model(model) overhead = 4 # per-message structural tokens sys_tokens = len(enc.encode(system)) + overhead user_tokens = len(enc.encode(user_msg)) + overhead budget = max_ctx - sys_tokens - user_tokens - reserve selected = [] used = 0 for chunk in chunks: ct = len(enc.encode(chunk)) if used + ct > budget: break selected.append(chunk) used += ct return { "system": system, "context": "\n\n".join(selected), "user": user_msg, "tokens_used": sys_tokens + user_tokens + used, "tokens_remaining": budget - used, }
len(text.split()) or len(text) / 4 — these
heuristics break for code, non-English text, and structured data.
Algorithm Comparison
BPE
- Merge criterion: most frequent pair
- Direction: bottom-up (merge)
- Unknown tokens: none (byte-level)
- Used by: GPT-2/3/4, Claude, LLaMA
- Deterministic: yes
- Speed: fast (O(n·m))
WordPiece
- Merge criterion: max likelihood ratio
- Direction: bottom-up (merge)
- Unknown tokens: [UNK] fallback
- Used by: BERT, DistilBERT, Electra
- Deterministic: yes
- Prefix: ## for continuations
SentencePiece (Unigram)
- Merge criterion: minimise corpus loss
- Direction: top-down (prune)
- Unknown tokens: byte fallback
- Used by: T5, ALBERT, XLNet, mBART
- Deterministic: can sample
- Special: subword regularisation
Tiktoken
- Algorithm: byte-level BPE
- Implementation: Rust + Python bindings
- Unknown tokens: none (byte-level)
- Used by: GPT-3.5/4/4o, OpenAI APIs
- Speed: 3–6× faster than HF tokenizers
- Vocab: up to 200k (o200k_base)
When to Choose What
| Scenario | Recommended Tokenizer | Reason |
|---|---|---|
| Calling OpenAI APIs | Tiktoken | Exact match with API tokenisation; fast pre-counting |
| Fine-tuning BERT | WordPiece (via HF) | Must match BERT's pre-trained vocabulary |
| Training from scratch (multilingual) | SentencePiece (Unigram) | Language-agnostic; subword regularisation boosts robustness |
| Serving LLaMA / Mistral | SentencePiece (BPE) | Must use the tokenizer the model was trained with |
| Custom domain (legal, medical) | SentencePiece (BPE or Unigram) | Train on domain corpus for better compression of jargon |
Interview Quick-Fire
A: Whitespace splitting creates an open vocabulary (every new word is OOV). Subword tokenizers bound the vocabulary, handle morphology, and never produce unknowns (byte-level BPE). They also compress text far better — critical for fitting more context into limited windows.
A: Every embedding and output projection weight is keyed by token ID. A different tokenizer maps strings to different IDs, so the model receives nonsensical embeddings. Quality drops to random chance — and no error is raised, making this a dangerous silent failure.
A: Larger vocabularies mean fewer tokens per input (better compression, lower latency, lower cost) but a larger embedding matrix (more parameters, more memory). The sweet spot is empirical — GPT-4 uses 100k, GPT-4o uses 200k, LLaMA 2 uses 32k. Multilingual models need larger vocabularies.