Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence
    (GQA)

    GQA (Grouped-Query Attention)

    Also known as:
    Grouped Query Attention
    GQA
    Updated: 2/9/2026

    An attention variant where multiple Query heads share a single Key-Value pair to reduce KV-Cache size and memory consumption.

    Quick Summary

    GQA shares KV heads between Query groups – drastically smaller KV-Cache with minimal quality loss.

    Explanation

    Standard Multi-Head Attention: Each head has its own Q, K, V. Multi-Query Attention (MQA): All heads share K, V. GQA is the compromise: Groups of heads share K, V. Example: 32 Query heads, 8 KV heads (groups of 4). Reduces KV-Cache by 4x with minimal quality loss.

    Marketing Relevance

    GQA is standard in Llama 2/3, Mistral, Gemma. Enables longer contexts and larger batch sizes on the same GPU.

    Example

    Llama 2 70B with GQA (8 KV heads) needs ~5x less KV-Cache than standard attention (32 KV heads), enabling 128K context.

    Common Pitfalls

    Too few KV heads can reduce quality. Optimal Query:KV ratio varies by model size.

    Origin & History

    GQA was introduced in 2023 by Ainslie et al. (Google) as a compromise between MHA and MQA. Was quickly adopted by Llama 2, Mistral, and other open-source models.

    Comparisons & Differences

    GQA (Grouped-Query Attention) vs. Multi-Head Attention

    MHA has separate KV per head; GQA shares KV between groups, saving memory.

    GQA (Grouped-Query Attention) vs. Multi-Query Attention

    MQA shares one KV for all heads (more aggressive); GQA shares per group (better quality-memory tradeoff).

    Related Services

    Related Terms

    👋Questions? Chat with us!