Feed-Forward Network (FFN)
In the Transformer context: a two-layer MLP applied independently to each position after the attention layer.
The FFN in Transformers stores knowledge in two linear layers with activation – making up 2/3 of all parameters, processing what attention found.
Explanation
FFN(x) = GELU(xW₁ + b₁)W₂ + b₂. Inner dimension is typically 4x model dimension (e.g., d_model=4096 → d_ff=16384). This is where "knowledge is stored" – attention finds relevant info, FFN processes it. SwiGLU in modern LLMs (LLaMA) replaces GELU.
Marketing Relevance
FFN parameters make up ~2/3 of Transformer parameters – this is where most "knowledge" is stored.
Common Pitfalls
FFN expansion ratio (4x) uses most parameters. SwiGLU needs 8/3x instead of 4x. MoE optimizes FFN through sparse routing.
Origin & History
Position-wise FFN was part of the original Transformer (2017). GPT and BERT used GELU instead of ReLU. LLaMA (2023) introduced SwiGLU activation which became the norm in modern LLMs. MoE models (Mixtral, GPT-4) make FFN sparse.
Comparisons & Differences
Feed-Forward Network (FFN) vs. Mixture of Experts (MoE)
Standard FFN: every token passes through all parameters. MoE: router selects 2 of 8+ expert FFNs – more capacity at same compute.