Tokenization: BPE, WordPiece, SentencePiece & Tiktoken

MLOps Series LLM Fundamentals

Why Tokenization Matters for Ops

Every large language model sees tokens, not characters or words. Tokenization is the first transformation applied to raw text before it ever reaches an embedding layer. For MLOps engineers, tokenization directly controls three operational levers: cost (API pricing is per token), latency (sequence length drives quadratic attention cost), and correctness (a mismatch between training and serving tokenizers silently degrades quality).

Ops insight: A production prompt that looks like 200 words can expand to 350+ tokens depending on the tokenizer. Off-by-one token budget mistakes are the #1 cause of silent truncation in RAG pipelines.

At a high level, a tokenizer maps a string to a sequence of integer IDs from a fixed vocabulary. The vocabulary is learned during a training phase on a large corpus and then frozen. At inference time the tokenizer is deterministic — the same string always produces the same token sequence.

The four dominant tokenization families in production LLMs today are Byte-Pair Encoding (BPE), WordPiece, SentencePiece (with its Unigram model), and Tiktoken (OpenAI's optimised BPE implementation). We will examine each in detail.

BPE Algorithm Walkthrough

Byte-Pair Encoding was originally a data-compression algorithm (Gage, 1994) and was adapted for subword tokenization by Sennrich et al. (2016). It is the foundation of GPT-2, GPT-3, GPT-4, LLaMA, and Claude tokenizers. The idea is elegantly simple: start with individual bytes (or characters) and iteratively merge the most frequent adjacent pair until the vocabulary reaches a target size.

Training Phase

Initialise vocabulary with all individual bytes (256 entries for byte-level BPE).
Count all adjacent pairs across the corpus.
Merge the most frequent pair into a new token and add it to the vocabulary.
Repeat steps 2–3 until vocabulary reaches the desired size (e.g., 50,257 for GPT-2, 100,277 for cl100k_base).

# BPE training — simplified pseudocode
def train_bpe(corpus, vocab_size):
    # Start with byte-level tokens
    vocab = {bytes(i): i for i in range(256)}
    splits = [list(word.encode("utf-8")) for word in corpus]

    while len(vocab) < vocab_size:
        pairs = count_pairs(splits)          # count adjacent pairs
        best = max(pairs, key=pairs.get)      # most frequent pair
        splits = merge_pair(splits, best)     # merge everywhere
        vocab[best] = len(vocab)              # add merged token

    return vocab

Encoding (Inference) Phase

Given a trained merge table, encoding is a greedy left-to-right process. The input is split into bytes, then merges are applied in priority order (the order they were learned during training). This is deterministic and fast — O(n × m) where n is the input length and m is the number of merges applicable.

Key property: BPE never creates a token that was not observed as a substring of the training corpus. The merge ordering encodes frequency information — earlier merges correspond to more common subword units.

WordPiece vs BPE

WordPiece (Schuster & Nakajima, 2012) is the tokenizer behind BERT, DistilBERT, and Electra. It is structurally similar to BPE but differs in the merge criterion. Instead of picking the most frequent pair, WordPiece picks the pair that maximises the likelihood of the training corpus under a unigram language model:

# WordPiece merge scoring
# BPE criterion:  score = count(a, b)
# WordPiece criterion:
score(a, b) = count(ab) / (count(a) * count(b))

This means WordPiece favours merges that create subwords which are surprisingly common relative to their parts, not just globally frequent. In practice the resulting vocabularies are similar, but WordPiece tends to keep rarer morphological units intact.

Operational Differences

Prefix marker: WordPiece uses a ## prefix for continuation tokens (e.g., play → play, ##ing), while BPE uses space-prefixed tokens (e.g., Ġplaying in GPT-2).
Unknown handling: WordPiece falls back to [UNK] if a character is not in the vocabulary; byte-level BPE never produces unknowns because every byte is in the base vocabulary.
Vocabulary size: BERT uses 30,522 tokens; GPT-2 BPE uses 50,257; GPT-4 cl100k_base uses 100,277.

Watch out: Mixing tokenizers across model versions is a silent correctness bug. If you fine-tune BERT (WordPiece) but serve behind a pipeline that pre-tokenizes with BPE, every prediction will be garbage — and no error is raised.

SentencePiece & the Unigram Model

SentencePiece (Kudo & Richardson, 2018) is a language-agnostic tokenization library used by T5, ALBERT, XLNet, and LLaMA. Its key innovation is treating the input as a raw stream of Unicode characters (or bytes) without pre-tokenization — no language-specific whitespace or punctuation rules. The sentinel character ▁ (U+2581) marks word boundaries.

SentencePiece supports two sub-algorithms: BPE and the Unigram Language Model. The Unigram approach works in the opposite direction to BPE:

Start large: initialise with a very large candidate vocabulary (often all substrings up to a length limit, seeded from the corpus).
Compute loss: for each candidate token, compute the unigram log-likelihood of the training corpus if that token were removed.
Prune: remove the tokens whose removal increases loss the least (i.e., they are the least useful), keeping a fixed percentage per iteration.
Repeat until the vocabulary reaches the target size.

# Unigram LM tokenization (Viterbi decoding)
def tokenize_unigram(text, vocab, scores):
    # Find the segmentation that maximises sum of log-probs
    n = len(text)
    best_score = [-float("inf")] * (n + 1)
    best_score[0] = 0.0
    best_edge  = [None] * (n + 1)

    for end in range(1, n + 1):
        for start in range(end):
            sub = text[start:end]
            if sub in vocab:
                s = best_score[start] + scores[sub]
                if s > best_score[end]:
                    best_score[end] = s
                    best_edge[end] = start

    # Back-track to recover tokens
    tokens, i = [], n
    while i > 0:
        tokens.append(text[best_edge[i]:i])
        i = best_edge[i]
    return tokens[::-1]

Why Unigram matters: Unlike BPE, the Unigram model can sample multiple valid segmentations for the same string (subword regularisation). This acts as a data augmentation technique during training, improving robustness.

Tiktoken — OpenAI's Fast Tokenizer

Tiktoken is OpenAI's open-source tokenizer library, first released in late 2022. It implements byte-level BPE but is written in Rust with Python bindings, making it 3–6× faster than the HuggingFace tokenizers library for encoding. It is the canonical tokenizer for GPT-3.5, GPT-4, and the embeddings API.

Encoding Names

gpt2 — 50,257 tokens (GPT-2, GPT-3)
r50k_base — 50,257 tokens (text-davinci-002, code-davinci-002)
p50k_base — 50,281 tokens (text-davinci-003, Codex)
cl100k_base — 100,277 tokens (GPT-3.5-turbo, GPT-4, text-embedding-ada-002)
o200k_base — 200,019 tokens (GPT-4o)

The jump from 50k to 100k tokens in cl100k_base was specifically motivated by better coverage of non-English languages and code. Larger vocabularies compress text more (fewer tokens per sentence), reducing latency and cost at the expense of a larger embedding matrix.

Ops implication: When migrating from GPT-3 (gpt2 encoding) to GPT-4 (cl100k_base), the same prompt uses ~15% fewer tokens. Always re-benchmark token budgets after model migration.

Token Count Estimation & Context Length

Context length is the maximum number of tokens a model can process in a single forward pass. Exceeding it causes either truncation (silent data loss) or an API error. Accurate token estimation is critical for:

Prompt engineering: fitting system prompt + few-shot examples + user query + completion budget.
RAG pipelines: choosing how many retrieved chunks fit in context.
Cost forecasting: pre-computing spend before hitting the API.
Batching: packing multiple requests to maximise GPU utilisation.

Rules of Thumb

Language	~Tokens per Word (cl100k)	~Chars per Token
English	~1.15	~4.3
Python code	~2.0	~2.8
Chinese	~2.5	~1.4
Japanese	~2.8	~1.2
JSON / structured data	~2.5	~2.0

Pitfall: Special tokens (<|im_start|>, <|im_sep|>) and chat-ML framing add overhead that len(text) / 4 heuristics miss. For GPT-4 chat completions, each message adds ~4 tokens of structural overhead.

Context Windows by Model

Model	Context Length	Encoding
GPT-3.5-turbo	16,385	cl100k_base
GPT-4	8,192 / 32,768	cl100k_base
GPT-4-turbo	128,000	cl100k_base
GPT-4o	128,000	o200k_base
Claude 3.5 Sonnet	200,000	Proprietary BPE
LLaMA 3 (70B)	8,192	SentencePiece BPE

Python Code Examples

Tiktoken — Counting Tokens

import tiktoken

# Load the encoding used by GPT-4
enc = tiktoken.get_encoding("cl100k_base")

text = "Tokenization is the first step in every LLM pipeline."
tokens = enc.encode(text)

print(f"Text:   {text}")
print(f"Tokens: {tokens}")
print(f"Count:  {len(tokens)}")
# Output:
# Text:   Tokenization is the first step in every LLM pipeline.
# Tokens: [3947, 2065, 374, 279, 1176, 3094, 304, 1475, 445, 11237, 15006, 13]
# Count:  12

Tiktoken — Decoding and Inspecting

# Decode individual tokens to see subwords
for tid in tokens:
    print(f"  {tid:6d} → {enc.decode([tid])!r}")

# Output:
#    3947 → 'Token'
#    2065 → 'ization'
#     374 → ' is'
#     279 → ' the'
#    1176 → ' first'
#    3094 → ' step'
#     304 → ' in'
#    1475 → ' every'
#     445 → ' L'
#   11237 → 'LM'
#   15006 → ' pipeline'
#      13 → '.'

Tiktoken — Model-Based Shortcut

# Get the correct encoding for a specific model
enc_4o = tiktoken.encoding_for_model("gpt-4o")  # → o200k_base
enc_4  = tiktoken.encoding_for_model("gpt-4")   # → cl100k_base

prompt = "Explain quantum computing in simple terms."
print(f"GPT-4o tokens: {len(enc_4o.encode(prompt))}")
print(f"GPT-4  tokens: {len(enc_4.encode(prompt))}")
# GPT-4o tokens: 7
# GPT-4  tokens: 8

SentencePiece — Training a Custom Tokenizer

import sentencepiece as spm

# Train a BPE tokenizer on a corpus file
spm.SentencePieceTrainer.train(
    input="corpus.txt",
    model_prefix="my_tok",
    vocab_size=32000,
    model_type="bpe",            # or "unigram"
    byte_fallback=True,         # handle unseen chars via bytes
    character_coverage=0.9995,   # cover 99.95% of characters
)

# Load and use the trained model
sp = spm.SentencePieceProcessor(model_file="my_tok.model")

text = "Tokenization handles multilingual text well."
pieces = sp.encode(text, out_type=str)
ids    = sp.encode(text, out_type=int)

print(f"Pieces: {pieces}")
print(f"IDs:    {ids}")
# Pieces: ['▁Token', 'ization', '▁handles', '▁multi', 'lingual', '▁text', '▁well', '.']

Budget-Aware Prompt Assembly

import tiktoken

def assemble_prompt(system, user_msg, chunks, model="gpt-4", max_ctx=8192, reserve=512):
    """Pack as many RAG chunks as fit within the context budget."""
    enc = tiktoken.encoding_for_model(model)
    overhead = 4  # per-message structural tokens

    sys_tokens  = len(enc.encode(system)) + overhead
    user_tokens = len(enc.encode(user_msg)) + overhead
    budget = max_ctx - sys_tokens - user_tokens - reserve

    selected = []
    used = 0
    for chunk in chunks:
        ct = len(enc.encode(chunk))
        if used + ct > budget:
            break
        selected.append(chunk)
        used += ct

    return {
        "system": system,
        "context": "\n\n".join(selected),
        "user": user_msg,
        "tokens_used": sys_tokens + user_tokens + used,
        "tokens_remaining": budget - used,
    }

Production tip: Always encode with the exact tokenizer your model uses. Do not estimate with len(text.split()) or len(text) / 4 — these heuristics break for code, non-English text, and structured data.

Algorithm Comparison

BPE

Merge criterion: most frequent pair
Direction: bottom-up (merge)
Unknown tokens: none (byte-level)
Used by: GPT-2/3/4, Claude, LLaMA
Deterministic: yes
Speed: fast (O(n·m))

WordPiece

Merge criterion: max likelihood ratio
Direction: bottom-up (merge)
Unknown tokens: [UNK] fallback
Used by: BERT, DistilBERT, Electra
Deterministic: yes
Prefix: ## for continuations

SentencePiece (Unigram)

Merge criterion: minimise corpus loss
Direction: top-down (prune)
Unknown tokens: byte fallback
Used by: T5, ALBERT, XLNet, mBART
Deterministic: can sample
Special: subword regularisation

Tiktoken

Algorithm: byte-level BPE
Implementation: Rust + Python bindings
Unknown tokens: none (byte-level)
Used by: GPT-3.5/4/4o, OpenAI APIs
Speed: 3–6× faster than HF tokenizers
Vocab: up to 200k (o200k_base)

When to Choose What

Scenario	Recommended Tokenizer	Reason
Calling OpenAI APIs	Tiktoken	Exact match with API tokenisation; fast pre-counting
Fine-tuning BERT	WordPiece (via HF)	Must match BERT's pre-trained vocabulary
Training from scratch (multilingual)	SentencePiece (Unigram)	Language-agnostic; subword regularisation boosts robustness
Serving LLaMA / Mistral	SentencePiece (BPE)	Must use the tokenizer the model was trained with
Custom domain (legal, medical)	SentencePiece (BPE or Unigram)	Train on domain corpus for better compression of jargon

Interview Quick-Fire

Q: Why can't you just split on whitespace?
A: Whitespace splitting creates an open vocabulary (every new word is OOV). Subword tokenizers bound the vocabulary, handle morphology, and never produce unknowns (byte-level BPE). They also compress text far better — critical for fitting more context into limited windows.

Q: What happens if you change the tokenizer but keep the same model weights?
A: Every embedding and output projection weight is keyed by token ID. A different tokenizer maps strings to different IDs, so the model receives nonsensical embeddings. Quality drops to random chance — and no error is raised, making this a dangerous silent failure.

Q: How does vocabulary size affect performance?
A: Larger vocabularies mean fewer tokens per input (better compression, lower latency, lower cost) but a larger embedding matrix (more parameters, more memory). The sweet spot is empirical — GPT-4 uses 100k, GPT-4o uses 200k, LLaMA 2 uses 32k. Multilingual models need larger vocabularies.