Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence
    (Multi-Head Attention)

    Multi-Head Attention (MHA)

    Also known as:
    MHA
    Multi-Headed Attention
    Parallel Attention Heads
    Updated: 2/10/2026

    Multi-Head Attention runs multiple attention computations in parallel with different learned projections and combines the results.

    Quick Summary

    Multi-Head Attention splits attention into parallel heads learning different aspects – the core module of every Transformer from GPT to Gemini.

    Explanation

    Instead of a single attention computation, queries, keys, and values are split into h heads (e.g., 32 in GPT-3). Each head learns different aspects: one for syntax, one for semantics, one for co-reference. Outputs are concatenated and linearly projected.

    Marketing Relevance

    Multi-Head Attention is the core module of every Transformer – GPT, BERT, T5, LLaMA, Gemini – all use it.

    Common Pitfalls

    More heads = more memory for KV cache. Head dimension must divide model dimension. Not all heads equally important (pruning possible).

    Origin & History

    Introduced in the "Attention Is All You Need" paper (Vaswani et al., 2017) as the core of the Transformer. The idea: instead of one broad attention, multiple focused "perspectives." Standard in every Transformer model since.

    Comparisons & Differences

    Multi-Head Attention (MHA) vs. Single-Head Attention

    Single-head has one perspective; Multi-Head learns different relationship patterns in parallel – significantly more expressive.

    Multi-Head Attention (MHA) vs. Grouped-Query Attention (GQA)

    MHA: Each head has its own K/V (full expressiveness, large KV cache); GQA: groups share K/V (less cache, nearly same quality).

    Related Services

    Related Terms

    👋Questions? Chat with us!