Unigram Model (Tokenization)
Subword tokenization algorithm that starts with a large vocabulary and iteratively removes least useful tokens.
The Unigram model tokenizes top-down: start with large vocabulary, iterative removal – standard in SentencePiece for T5, ALBERT, and XLNet.
Explanation
Unlike BPE (bottom-up), Unigram works top-down: it starts with many candidates and removes tokens that cause the least loss in likelihood. SentencePiece uses Unigram as its default algorithm.
Marketing Relevance
Unigram is the default algorithm in SentencePiece and is used by T5, ALBERT, and XLNet.
Common Pitfalls
Less common than BPE. Initial vocabulary must be chosen sensibly. Probabilistic sampling can yield non-deterministic results.
Origin & History
Taku Kudo (Google) published the Unigram model in 2018 alongside SentencePiece. It offers more theoretically grounded tokenization than BPE through likelihood optimization and probabilistic sampling (subword regularization).
Comparisons & Differences
Unigram Model (Tokenization) vs. BPE
BPE builds bottom-up by merging frequent pairs; Unigram removes top-down the least useful tokens.
Unigram Model (Tokenization) vs. WordPiece
WordPiece merges by likelihood like Unigram but works bottom-up; Unigram works top-down and supports subword regularization.