Multi-Head Attention (MHA)
Multi-Head Attention runs multiple attention computations in parallel with different learned projections and combines the results.
Multi-Head Attention splits attention into parallel heads learning different aspects – the core module of every Transformer from GPT to Gemini.
Explanation
Instead of a single attention computation, queries, keys, and values are split into h heads (e.g., 32 in GPT-3). Each head learns different aspects: one for syntax, one for semantics, one for co-reference. Outputs are concatenated and linearly projected.
Marketing Relevance
Multi-Head Attention is the core module of every Transformer – GPT, BERT, T5, LLaMA, Gemini – all use it.
Common Pitfalls
More heads = more memory for KV cache. Head dimension must divide model dimension. Not all heads equally important (pruning possible).
Origin & History
Introduced in the "Attention Is All You Need" paper (Vaswani et al., 2017) as the core of the Transformer. The idea: instead of one broad attention, multiple focused "perspectives." Standard in every Transformer model since.
Comparisons & Differences
Multi-Head Attention (MHA) vs. Single-Head Attention
Single-head has one perspective; Multi-Head learns different relationship patterns in parallel – significantly more expressive.
Multi-Head Attention (MHA) vs. Grouped-Query Attention (GQA)
MHA: Each head has its own K/V (full expressiveness, large KV cache); GQA: groups share K/V (less cache, nearly same quality).