Stemming
Rule-based reduction of words to their stem by removing suffixes.
Stemming reduces words to their stem using rules for search engines and text retrieval – fast but less accurate than lemmatization.
Explanation
Stemming cuts word endings: "running" → "run", "computers" → "comput". It is fast but imprecise – the stem doesn't have to be a real word.
Marketing Relevance
Stemming is used in search engines and information retrieval for text normalization.
Common Pitfalls
Over-stemming: Different meanings reduced to same stem. Under-stemming: Related forms not recognized.
Origin & History
Martin Porter developed the Porter Stemmer in 1980, which remains the most well-known algorithm. Snowball (Porter2) improved it in 2001 for more languages. With the rise of LLMs, stemming is losing importance but remains relevant in classical search systems.
Comparisons & Differences
Stemming vs. Lemmatization
Stemming cuts suffixes using rules; lemmatization uses linguistic knowledge and produces real word forms.
Stemming vs. Subword Tokenization
Stemming normalizes for retrieval; subword tokenization splits for neural models – different goals and methods.