Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    GELU (Gaussian Error Linear Unit)

    Also known as:
    Gaussian Error Linear Unit
    GELU Activation
    Updated: 2/11/2026

    A smooth activation function that weights inputs by their cumulative normal distribution probability – standard in BERT, GPT-2, and many Transformers.

    Quick Summary

    GELU weights inputs by normal distribution probability – the activation behind BERT and GPT, superseded by SwiGLU in latest LLMs.

    Explanation

    GELU(x) ≈ x · Φ(x), where Φ is the cumulative normal distribution. Unlike ReLU (hard thresholding), GELU dampens inputs smoothly. Often computed with tanh approximation. Superseded by SwiGLU in modern LLMs.

    Marketing Relevance

    GELU was the first activation function to replace ReLU in Transformers – in BERT, GPT-2/3, and many Vision Transformers.

    Common Pitfalls

    More computationally expensive than ReLU. Outperformed by SwiGLU in latest LLMs. Different approximations (tanh vs. sigmoid) can slightly change results.

    Origin & History

    Hendrycks and Gimpel (2016) introduced GELU. BERT (2018) and GPT-2 (2019) made GELU the standard. GPT-3 and Vision Transformers adopted GELU as well. From 2022, GELU was increasingly replaced by SwiGLU.

    Comparisons & Differences

    GELU (Gaussian Error Linear Unit) vs. ReLU

    ReLU is piecewise linear (0 for negative values); GELU is smooth and softly dampens negative values instead of cutting them.

    GELU (Gaussian Error Linear Unit) vs. SwiGLU

    GELU is a simple activation; SwiGLU combines gating with projection and achieves better LLM quality.

    Related Services

    Related Terms

    👋Questions? Chat with us!