TF-IDF
Statistical measure for evaluating the relevance of a word in a document relative to a document collection.
TF-IDF evaluates word relevance through frequency in document (TF) weighted by rarity in corpus (IDF) – foundation of classical search systems and BM25.
Explanation
TF (Term Frequency) measures word frequency in the document, IDF (Inverse Document Frequency) reduces weighting of common words. TF-IDF = TF × IDF. "Marketing" in a marketing blog has high TF but low IDF.
Marketing Relevance
TF-IDF is a building block for search engines, information retrieval, and classical NLP.
Common Pitfalls
Ignores word meaning and order. Cannot handle synonyms. Increasingly replaced by dense retrieval.
Origin & History
Karen Spärck Jones coined the IDF concept in 1972 at Cambridge. TF-IDF became the standard in information retrieval. BM25 (Robertson et al., 1994) improved TF-IDF with document length normalization. Despite dense retrieval, TF-IDF remains relevant in hybrid search systems.
Comparisons & Differences
TF-IDF vs. BM25
BM25 is an evolution of TF-IDF with saturation function and document length normalization – the standard in Elasticsearch and Lucene.
TF-IDF vs. Dense Retrieval
TF-IDF uses exact word matching (sparse); dense retrieval uses semantic vectors for meaning similarity.