Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    WordPiece

    Updated: 2/10/2026

    Subword tokenization algorithm developed by Google that maximizes training corpus likelihood.

    Quick Summary

    WordPiece is Google's subword tokenizer for BERT – maximizes training corpus likelihood instead of just frequency like BPE.

    Explanation

    WordPiece selects merges that maximize the overall probability of the training corpus. BERT uses WordPiece with a "##" prefix for subword continuations.

    Marketing Relevance

    WordPiece is the tokenizer behind BERT and many Google NLP models.

    Common Pitfalls

    The "##" prefix can be confusing in text generation. Not as widely used as BPE in modern LLMs.

    Origin & History

    Google originally developed WordPiece for Japanese/Korean speech recognition (Schuster & Nakajima, 2012). It was adapted for BERT (2018) and became the standard tokenizer for the BERT family.

    Comparisons & Differences

    WordPiece vs. BPE

    BPE merges by frequency; WordPiece by likelihood maximization. BPE dominates in GPT, WordPiece in BERT.

    WordPiece vs. Unigram

    Unigram starts with a large vocabulary and removes tokens; WordPiece builds from bottom up. Unigram is used in SentencePiece.

    Related Services

    Related Terms

    👋Questions? Chat with us!