WordPiece
Subword tokenization algorithm developed by Google that maximizes training corpus likelihood.
WordPiece is Google's subword tokenizer for BERT – maximizes training corpus likelihood instead of just frequency like BPE.
Explanation
WordPiece selects merges that maximize the overall probability of the training corpus. BERT uses WordPiece with a "##" prefix for subword continuations.
Marketing Relevance
WordPiece is the tokenizer behind BERT and many Google NLP models.
Common Pitfalls
The "##" prefix can be confusing in text generation. Not as widely used as BPE in modern LLMs.
Origin & History
Google originally developed WordPiece for Japanese/Korean speech recognition (Schuster & Nakajima, 2012). It was adapted for BERT (2018) and became the standard tokenizer for the BERT family.
Comparisons & Differences
WordPiece vs. BPE
BPE merges by frequency; WordPiece by likelihood maximization. BPE dominates in GPT, WordPiece in BERT.
WordPiece vs. Unigram
Unigram starts with a large vocabulary and removes tokens; WordPiece builds from bottom up. Unigram is used in SentencePiece.