Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence
    (Vokabular (NLP))

    Vocabulary (NLP)

    Updated: 2/10/2026

    The complete set of all tokens that a language model knows and can process.

    Quick Summary

    An LLM's vocabulary defines all tokens it knows – size (32K-128K) affects efficiency, costs, and multilingual capabilities.

    Explanation

    The vocabulary defines the "language" of a model. GPT-4 has ~100,000 tokens, Llama 3 has 128,000 tokens. Larger vocabulary = shorter sequences but larger embedding matrix.

    Marketing Relevance

    Vocabulary size directly affects tokenization efficiency, model size, and multilingual capabilities.

    Common Pitfalls

    Too small vocabulary fragments words excessively. Too large vocabulary wastes parameters. OOV tokens for unknown words.

    Origin & History

    Early NLP systems used word-based vocabularies with 50,000-100,000 entries. Subword tokenization (BPE, 2016) reduced OOV problems. GPT-2 used 50,257 tokens, GPT-4 expanded to ~100,000, Llama 3 to 128,000 for better multilingual support.

    Comparisons & Differences

    Vocabulary (NLP) vs. Embedding

    Vocabulary defines which tokens exist; embeddings assign each token a vector encoding its meaning.

    Vocabulary (NLP) vs. Dictionary

    A dictionary contains word definitions; an NLP vocabulary is a token-ID mapping without linguistic meaning.

    Related Services

    Related Terms

    👋Questions? Chat with us!