BPE (Byte Pair Encoding)
Subword tokenization algorithm that iteratively merges frequent character pairs to create an optimal vocabulary.
BPE creates a subword vocabulary by iteratively merging frequent character pairs – basis for GPT tokenizers (tiktoken) and most modern LLMs.
Explanation
BPE starts with individual characters and iteratively merges the most frequent pairs. "low", "lower", "lowest" share the subword "low". GPT models use BPE via tiktoken.
Marketing Relevance
BPE is the tokenizer standard for GPT models and the foundation for efficient text processing in LLMs.
Common Pitfalls
Vocabulary size must be chosen as hyperparameter. Greedy merging doesn't always find the optimal split. Not all languages benefit equally.
Origin & History
BPE originally comes from data compression (Gage, 1994). Sennrich et al. adapted BPE for neural machine translation in 2016. OpenAI used BPE for all GPT models. tiktoken (2022) optimized the BPE implementation for speed.
Comparisons & Differences
BPE (Byte Pair Encoding) vs. WordPiece
BPE merges by frequency; WordPiece maximizes training corpus likelihood. BPE is used by GPT, WordPiece by BERT.
BPE (Byte Pair Encoding) vs. SentencePiece
SentencePiece is a framework that can use BPE or Unigram as algorithm; BPE is a specific algorithm.