Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Attention Sink

    Also known as:
    Sink Token
    BOS Attention Concentration
    Updated: 2/11/2026

    A phenomenon in LLMs where the first token (BOS) receives disproportionately high attention, even when semantically irrelevant.

    Quick Summary

    Attention sinks "park" excess attention on the first token – StreamingLLM uses this for unlimited context at constant memory.

    Explanation

    Softmax forces attention weights to sum to 1. When a token has nothing relevant to attend to, it "parks" attention on the first token (sink). StreamingLLM exploits attention sinks by keeping BOS tokens in the KV cache, enabling streaming over unlimited contexts.

    Marketing Relevance

    Understanding attention sinks enables efficient streaming inference with unlimited context at constant memory.

    Common Pitfalls

    Not all models have equally strong attention sinks. Removing the BOS token from cache can dramatically degrade model quality.

    Origin & History

    Xiao et al. (MIT, 2023) discovered attention sinks and developed StreamingLLM. The insight: only 4 sink tokens + window suffice for stable inference over millions of tokens.

    Comparisons & Differences

    Attention Sink vs. Sliding Window Attention

    SWA limits attention to a window; Attention Sink + SWA (StreamingLLM) additionally keeps BOS tokens for stability.

    Related Services

    Related Terms

    👋Questions? Chat with us!