Attention Pooling
Attention pooling aggregates a sequence of vectors into a single representation vector by giving learned attention weights more importance to the most relevant elements.
Attention pooling weights token representations intelligently instead of uniformly – produces better embeddings by focusing on the most informative elements.
Explanation
Instead of mean pooling (all tokens weighted equally) or CLS token (only one token): attention pooling learns which tokens are most informative. Used for sentence embeddings, document representation, and multi-instance learning.
Marketing Relevance
Improves embedding quality for retrieval and similarity search – important for RAG pipelines and semantic search.
Origin & History
Attention pooling was developed in various contexts: multi-instance learning (Ilse et al., 2018), sentence embeddings, and document classification. Modern embedding models like E5 and BGE use variants of attention pooling for better representations.
Comparisons & Differences
Attention Pooling vs. Mean Pooling
Mean pooling weights all tokens equally; attention pooling learns different weights based on relevance.
Attention Pooling vs. CLS Token
CLS uses only one special token as representation; attention pooling aggregates information from all tokens weighted.