Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    SwiGLU

    Also known as:
    Swish-GLU
    Gated Linear Unit with Swish
    Updated: 2/11/2026

    An activation function for Transformer FFN blocks combining Swish gating with linear projection, standard in modern LLMs like LLaMA.

    Quick Summary

    SwiGLU combines Swish gating with linear projection – the standard activation in LLaMA, Mistral, and modern LLMs for better quality at same size.

    Explanation

    SwiGLU(x) = Swish(xW₁) ⊙ (xW₂), where ⊙ is element-wise multiplication. Combines gating (Swish) with linear transformation. Needs 3 projection matrices instead of 2 (with GELU-FFN), but better quality at same parameter count.

    Marketing Relevance

    SwiGLU is the standard activation function in LLaMA, Mistral, Gemma, and most modern open-source LLMs.

    Common Pitfalls

    Higher memory due to 3 instead of 2 projections. Inner dimension typically 2/3 of standard FFN to match parameter budget.

    Origin & History

    Shazeer (2020) compared various GLU variants for Transformers and found SwiGLU as the best option. PaLM (2022) and LLaMA (2023) adopted SwiGLU and made it the de facto standard for open-source LLMs.

    Comparisons & Differences

    SwiGLU vs. GELU

    GELU is ungated (simple activation); SwiGLU uses gating for better expressiveness with more parameters.

    SwiGLU vs. ReLU

    ReLU is the simplest activation; SwiGLU is significantly more complex with gating but significantly better LLM quality.

    Related Services

    Related Terms

    👋Questions? Chat with us!