Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence
    (Unigram-Modell (Tokenization))

    Unigram Model (Tokenization)

    Updated: 2/11/2026

    Subword tokenization algorithm that starts with a large vocabulary and iteratively removes least useful tokens.

    Quick Summary

    The Unigram model tokenizes top-down: start with large vocabulary, iterative removal – standard in SentencePiece for T5, ALBERT, and XLNet.

    Explanation

    Unlike BPE (bottom-up), Unigram works top-down: it starts with many candidates and removes tokens that cause the least loss in likelihood. SentencePiece uses Unigram as its default algorithm.

    Marketing Relevance

    Unigram is the default algorithm in SentencePiece and is used by T5, ALBERT, and XLNet.

    Common Pitfalls

    Less common than BPE. Initial vocabulary must be chosen sensibly. Probabilistic sampling can yield non-deterministic results.

    Origin & History

    Taku Kudo (Google) published the Unigram model in 2018 alongside SentencePiece. It offers more theoretically grounded tokenization than BPE through likelihood optimization and probabilistic sampling (subword regularization).

    Comparisons & Differences

    Unigram Model (Tokenization) vs. BPE

    BPE builds bottom-up by merging frequent pairs; Unigram removes top-down the least useful tokens.

    Unigram Model (Tokenization) vs. WordPiece

    WordPiece merges by likelihood like Unigram but works bottom-up; Unigram works top-down and supports subword regularization.

    Related Services

    Related Terms

    👋Questions? Chat with us!