Question 1

What is Attention Sink?

Accepted Answer

A phenomenon in LLMs where the first token (BOS) receives disproportionately high attention, even when semantically irrelevant. Softmax forces attention weights to sum to 1. When a token has nothing relevant to attend to, it "parks" attention on the first token (sink). StreamingLLM exploits attention sinks by keeping BOS tokens in the KV cache, enabling streaming over unlimited contexts.

Question 2

How does Attention Sink work?

Accepted Answer

Softmax forces attention weights to sum to 1. When a token has nothing relevant to attend to, it "parks" attention on the first token (sink). StreamingLLM exploits attention sinks by keeping BOS tokens in the KV cache, enabling streaming over unlimited contexts.

Question 3

Why is Attention Sink important for marketing?

Accepted Answer

Understanding attention sinks enables efficient streaming inference with unlimited context at constant memory.

Question 4

What are common mistakes with Attention Sink?

Accepted Answer

Not all models have equally strong attention sinks. Removing the BOS token from cache can dramatically degrade model quality.

Question 5

Where does Attention Sink come from?

Accepted Answer

Xiao et al. (MIT, 2023) discovered attention sinks and developed StreamingLLM. The insight: only 4 sink tokens + window suffice for stable inference over millions of tokens.

Question 6

What is the difference between Attention Sink and KV Cache (Key-Value Cache)?

Accepted Answer

Attention Sink and KV Cache (Key-Value Cache) are related concepts in AI and marketing. A phenomenon in LLMs where the first token (BOS) receives disproportionately high attention, even wh...

Attention Sink

Explanation

Marketing Relevance

Common Pitfalls

Origin & History

Comparisons & Differences

Attention Sink vs. Sliding Window Attention

Further Resources

Related Services

Related Terms