Text Normalization
Standardizing text data by converting to a uniform form – lowercasing, Unicode normalization, character replacement, and more.
Text normalization standardizes text data (lowercasing, Unicode, whitespace) as the first step of any NLP pipeline.
Explanation
Text normalization includes: lowercasing ("AI" → "ai"), Unicode normalization (accents, umlauts), whitespace cleanup, special character handling, and number standardization.
Marketing Relevance
Text normalization is the first step of any NLP pipeline and affects the quality of all subsequent processing steps.
Common Pitfalls
Over-normalization destroys information (casing for NER). Language-specific rules needed. Unicode edge cases.
Origin & History
Text normalization has been part of computational linguistics research since the 1960s. Unicode standard (1991) formalized character encoding. Modern systems use regex and Unicode libraries (ICU) for normalization. LLM tokenizers increasingly handle normalization automatically.
Comparisons & Differences
Text Normalization vs. Tokenization
Normalization cleans and standardizes text; tokenization splits the normalized text into token units.