Lemmatization
Linguistically informed reduction of words to their base form (lemma) considering part of speech and context.
Lemmatization reduces words to their linguistic base form (lemma) – more precise than stemming, used in spaCy and modern NLP.
Explanation
Lemmatization uses morphology and dictionaries: "better" → "good", "ran" → "run", "mice" → "mouse". Slower than stemming but semantically correct.
Marketing Relevance
Lemmatization provides more precise results than stemming for linguistically demanding NLP applications.
Common Pitfalls
Requires POS tagging for correct results. Slower than stemming. Language-dependent dictionaries needed.
Origin & History
Lemmatization has roots in computational linguistics research of the 1960s. WordNet (Princeton, 1985) became the standard lemma lexicon. spaCy (2015) and Stanza (Stanford, 2020) made lemmatization practical in Python.
Comparisons & Differences
Lemmatization vs. Stemming
Stemming is fast/rule-based but imprecise; lemmatization uses linguistic knowledge for correct base forms.
Lemmatization vs. Tokenization
Tokenization splits text into units; lemmatization normalizes these units to their base form.