Question 1

What is GQA (Grouped-Query Attention)?

Accepted Answer

An attention variant where multiple Query heads share a single Key-Value pair to reduce KV-Cache size and memory consumption. Standard Multi-Head Attention: Each head has its own Q, K, V. Multi-Query Attention (MQA): All heads share K, V. GQA is the compromise: Groups of heads share K, V. Example: 32 Query heads, 8 KV heads (groups of 4). Reduces KV-Cache by 4x with minimal quality loss.

Question 2

How does GQA (Grouped-Query Attention) work?

Accepted Answer

Standard Multi-Head Attention: Each head has its own Q, K, V. Multi-Query Attention (MQA): All heads share K, V. GQA is the compromise: Groups of heads share K, V. Example: 32 Query heads, 8 KV heads (groups of 4). Reduces KV-Cache by 4x with minimal quality loss.

Question 3

Why is GQA (Grouped-Query Attention) important for marketing?

Accepted Answer

GQA is standard in Llama 2/3, Mistral, Gemma. Enables longer contexts and larger batch sizes on the same GPU.

Question 4

How is GQA (Grouped-Query Attention) used in practice?

Accepted Answer

Llama 2 70B with GQA (8 KV heads) needs ~5x less KV-Cache than standard attention (32 KV heads), enabling 128K context.

Question 5

What are common mistakes with GQA (Grouped-Query Attention)?

Accepted Answer

Too few KV heads can reduce quality. Optimal Query:KV ratio varies by model size.

Question 6

Where does GQA (Grouped-Query Attention) come from?

Accepted Answer

GQA was introduced in 2023 by Ainslie et al. (Google) as a compromise between MHA and MQA. Was quickly adopted by Llama 2, Mistral, and other open-source models.

GQA (Grouped-Query Attention)

Explanation

Marketing Relevance

Example

Common Pitfalls

Origin & History

Comparisons & Differences

GQA (Grouped-Query Attention) vs. Multi-Head Attention

GQA (Grouped-Query Attention) vs. Multi-Query Attention

Further Resources

Related Services

Related Terms