Chunking
The process of dividing large documents into smaller, semantically coherent text segments for efficient embedding and retrieval in RAG systems.
Chunking divides documents into optimal text segments for RAG – the right chunk size determines retrieval quality and answer precision.
Explanation
Chunking strategies: Fixed-Size (simple but can destroy context), Semantic (uses NLP for natural boundaries), Recursive (hierarchical splitting), Sentence-Window (overlap for context). Chunk size affects precision vs. context trade-off: small chunks = precise matches, little context; large chunks = more context, less precise search.
Marketing Relevance
Chunking is critical for RAG quality in marketing. Wrong chunk size leads to irrelevant or out-of-context answers. Best practice: 200-500 tokens with 10-20% overlap for marketing content.
Example
A Knowledge GPT for product FAQs: Small chunks (1-2 sentences) for factual questions ("What does X cost?"), larger chunks (1-2 paragraphs) for conceptual questions ("How does our onboarding work?").
Common Pitfalls
One-size-fits-all chunking for different content types. No overlap leads to context loss. Too small chunks destroy coherence. Metadata (title, chapter) not integrated into chunks.
Origin & History
Text segmentation exists since classical NLP. With RAG (2020+), chunking became critical: LangChain and LlamaIndex popularized various strategies (fixed, recursive, semantic). 2024 saw context-aware and hierarchical approaches gain importance.
Comparisons & Differences
Chunking vs. Tokenization
Tokenization breaks text into sub-word units for LLM input; Chunking divides documents into semantically coherent sections for retrieval.
Chunking vs. Summarization
Summarization condenses information; Chunking preserves original text, just makes it retrievable.