GELU (Gaussian Error Linear Unit)
A smooth activation function that weights inputs by their cumulative normal distribution probability – standard in BERT, GPT-2, and many Transformers.
GELU weights inputs by normal distribution probability – the activation behind BERT and GPT, superseded by SwiGLU in latest LLMs.
Explanation
GELU(x) ≈ x · Φ(x), where Φ is the cumulative normal distribution. Unlike ReLU (hard thresholding), GELU dampens inputs smoothly. Often computed with tanh approximation. Superseded by SwiGLU in modern LLMs.
Marketing Relevance
GELU was the first activation function to replace ReLU in Transformers – in BERT, GPT-2/3, and many Vision Transformers.
Common Pitfalls
More computationally expensive than ReLU. Outperformed by SwiGLU in latest LLMs. Different approximations (tanh vs. sigmoid) can slightly change results.
Origin & History
Hendrycks and Gimpel (2016) introduced GELU. BERT (2018) and GPT-2 (2019) made GELU the standard. GPT-3 and Vision Transformers adopted GELU as well. From 2022, GELU was increasingly replaced by SwiGLU.
Comparisons & Differences
GELU (Gaussian Error Linear Unit) vs. ReLU
ReLU is piecewise linear (0 for negative values); GELU is smooth and softly dampens negative values instead of cutting them.
GELU (Gaussian Error Linear Unit) vs. SwiGLU
GELU is a simple activation; SwiGLU combines gating with projection and achieves better LLM quality.