Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Feed-Forward Network (FFN)

    Also known as:
    Position-wise FFN
    MLP Block
    Transformer FFN
    Point-wise Feed-Forward
    Updated: 2/10/2026

    In the Transformer context: a two-layer MLP applied independently to each position after the attention layer.

    Quick Summary

    The FFN in Transformers stores knowledge in two linear layers with activation – making up 2/3 of all parameters, processing what attention found.

    Explanation

    FFN(x) = GELU(xW₁ + b₁)W₂ + b₂. Inner dimension is typically 4x model dimension (e.g., d_model=4096 → d_ff=16384). This is where "knowledge is stored" – attention finds relevant info, FFN processes it. SwiGLU in modern LLMs (LLaMA) replaces GELU.

    Marketing Relevance

    FFN parameters make up ~2/3 of Transformer parameters – this is where most "knowledge" is stored.

    Common Pitfalls

    FFN expansion ratio (4x) uses most parameters. SwiGLU needs 8/3x instead of 4x. MoE optimizes FFN through sparse routing.

    Origin & History

    Position-wise FFN was part of the original Transformer (2017). GPT and BERT used GELU instead of ReLU. LLaMA (2023) introduced SwiGLU activation which became the norm in modern LLMs. MoE models (Mixtral, GPT-4) make FFN sparse.

    Comparisons & Differences

    Feed-Forward Network (FFN) vs. Mixture of Experts (MoE)

    Standard FFN: every token passes through all parameters. MoE: router selects 2 of 8+ expert FFNs – more capacity at same compute.

    Related Services

    Related Terms

    👋Questions? Chat with us!