Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Linear Attention

    Also known as:
    Kernelized Attention
    Efficient Attention
    Updated: 2/11/2026

    Attention variants that reduce the quadratic O(N²) complexity to linear O(N) through kernel approximation or alternative computation order.

    Quick Summary

    Linear attention reduces attention from O(N²) to O(N) – promising for ultra-long sequences but not yet at softmax parity.

    Explanation

    Standard attention: softmax(QK^T)V is O(N²). Linear attention uses feature maps φ: φ(Q)(φ(K)^T V), enabling O(N) computation through association. Variants: Performer (random features), RetNet (retention), Mamba (state space models).

    Marketing Relevance

    Linear attention is promising for ultra-long contexts but has not yet matched softmax attention quality in practice.

    Common Pitfalls

    Quality gap to softmax attention on many tasks. Kernel approximation can be unstable. Less mature implementations.

    Origin & History

    Katharopoulos et al. (2020) formalized linear attention. Performer (Google, 2020) used random features. RetNet (Microsoft, 2023) and Mamba (Gu & Dao, 2023) combined linear recurrence with attention-like quality.

    Comparisons & Differences

    Linear Attention vs. Softmax Attention

    Softmax attention is O(N²) but qualitatively superior; linear attention is O(N) but with quality tradeoff.

    Linear Attention vs. State Space Models (Mamba)

    SSMs achieve O(N) through recurrence instead of attention approximation – often better quality than pure linear attention.

    Related Services

    Related Terms

    👋Questions? Chat with us!