SwiGLU
An activation function for Transformer FFN blocks combining Swish gating with linear projection, standard in modern LLMs like LLaMA.
SwiGLU combines Swish gating with linear projection – the standard activation in LLaMA, Mistral, and modern LLMs for better quality at same size.
Explanation
SwiGLU(x) = Swish(xW₁) ⊙ (xW₂), where ⊙ is element-wise multiplication. Combines gating (Swish) with linear transformation. Needs 3 projection matrices instead of 2 (with GELU-FFN), but better quality at same parameter count.
Marketing Relevance
SwiGLU is the standard activation function in LLaMA, Mistral, Gemma, and most modern open-source LLMs.
Common Pitfalls
Higher memory due to 3 instead of 2 projections. Inner dimension typically 2/3 of standard FFN to match parameter budget.
Origin & History
Shazeer (2020) compared various GLU variants for Transformers and found SwiGLU as the best option. PaLM (2022) and LLaMA (2023) adopted SwiGLU and made it the de facto standard for open-source LLMs.
Comparisons & Differences
SwiGLU vs. GELU
GELU is ungated (simple activation); SwiGLU uses gating for better expressiveness with more parameters.
SwiGLU vs. ReLU
ReLU is the simplest activation; SwiGLU is significantly more complex with gating but significantly better LLM quality.