Attention Sink
A phenomenon in LLMs where the first token (BOS) receives disproportionately high attention, even when semantically irrelevant.
Attention sinks "park" excess attention on the first token – StreamingLLM uses this for unlimited context at constant memory.
Explanation
Softmax forces attention weights to sum to 1. When a token has nothing relevant to attend to, it "parks" attention on the first token (sink). StreamingLLM exploits attention sinks by keeping BOS tokens in the KV cache, enabling streaming over unlimited contexts.
Marketing Relevance
Understanding attention sinks enables efficient streaming inference with unlimited context at constant memory.
Common Pitfalls
Not all models have equally strong attention sinks. Removing the BOS token from cache can dramatically degrade model quality.
Origin & History
Xiao et al. (MIT, 2023) discovered attention sinks and developed StreamingLLM. The insight: only 4 sink tokens + window suffice for stable inference over millions of tokens.
Comparisons & Differences
Attention Sink vs. Sliding Window Attention
SWA limits attention to a window; Attention Sink + SWA (StreamingLLM) additionally keeps BOS tokens for stability.