Vocabulary (NLP)
The complete set of all tokens that a language model knows and can process.
An LLM's vocabulary defines all tokens it knows – size (32K-128K) affects efficiency, costs, and multilingual capabilities.
Explanation
The vocabulary defines the "language" of a model. GPT-4 has ~100,000 tokens, Llama 3 has 128,000 tokens. Larger vocabulary = shorter sequences but larger embedding matrix.
Marketing Relevance
Vocabulary size directly affects tokenization efficiency, model size, and multilingual capabilities.
Common Pitfalls
Too small vocabulary fragments words excessively. Too large vocabulary wastes parameters. OOV tokens for unknown words.
Origin & History
Early NLP systems used word-based vocabularies with 50,000-100,000 entries. Subword tokenization (BPE, 2016) reduced OOV problems. GPT-2 used 50,257 tokens, GPT-4 expanded to ~100,000, Llama 3 to 128,000 for better multilingual support.
Comparisons & Differences
Vocabulary (NLP) vs. Embedding
Vocabulary defines which tokens exist; embeddings assign each token a vector encoding its meaning.
Vocabulary (NLP) vs. Dictionary
A dictionary contains word definitions; an NLP vocabulary is a token-ID mapping without linguistic meaning.