Linear Attention
Attention variants that reduce the quadratic O(N²) complexity to linear O(N) through kernel approximation or alternative computation order.
Linear attention reduces attention from O(N²) to O(N) – promising for ultra-long sequences but not yet at softmax parity.
Explanation
Standard attention: softmax(QK^T)V is O(N²). Linear attention uses feature maps φ: φ(Q)(φ(K)^T V), enabling O(N) computation through association. Variants: Performer (random features), RetNet (retention), Mamba (state space models).
Marketing Relevance
Linear attention is promising for ultra-long contexts but has not yet matched softmax attention quality in practice.
Common Pitfalls
Quality gap to softmax attention on many tasks. Kernel approximation can be unstable. Less mature implementations.
Origin & History
Katharopoulos et al. (2020) formalized linear attention. Performer (Google, 2020) used random features. RetNet (Microsoft, 2023) and Mamba (Gu & Dao, 2023) combined linear recurrence with attention-like quality.
Comparisons & Differences
Linear Attention vs. Softmax Attention
Softmax attention is O(N²) but qualitatively superior; linear attention is O(N) but with quality tradeoff.
Linear Attention vs. State Space Models (Mamba)
SSMs achieve O(N) through recurrence instead of attention approximation – often better quality than pure linear attention.