GQA (Grouped-Query Attention)
An attention variant where multiple Query heads share a single Key-Value pair to reduce KV-Cache size and memory consumption.
GQA shares KV heads between Query groups – drastically smaller KV-Cache with minimal quality loss.
Explanation
Standard Multi-Head Attention: Each head has its own Q, K, V. Multi-Query Attention (MQA): All heads share K, V. GQA is the compromise: Groups of heads share K, V. Example: 32 Query heads, 8 KV heads (groups of 4). Reduces KV-Cache by 4x with minimal quality loss.
Marketing Relevance
GQA is standard in Llama 2/3, Mistral, Gemma. Enables longer contexts and larger batch sizes on the same GPU.
Example
Llama 2 70B with GQA (8 KV heads) needs ~5x less KV-Cache than standard attention (32 KV heads), enabling 128K context.
Common Pitfalls
Too few KV heads can reduce quality. Optimal Query:KV ratio varies by model size.
Origin & History
GQA was introduced in 2023 by Ainslie et al. (Google) as a compromise between MHA and MQA. Was quickly adopted by Llama 2, Mistral, and other open-source models.
Comparisons & Differences
GQA (Grouped-Query Attention) vs. Multi-Head Attention
MHA has separate KV per head; GQA shares KV between groups, saving memory.
GQA (Grouped-Query Attention) vs. Multi-Query Attention
MQA shares one KV for all heads (more aggressive); GQA shares per group (better quality-memory tradeoff).