Pre-LN vs. Post-LN
Refers to the placement of layer normalization in Transformer blocks: Pre-LN normalizes before attention/FFN, Post-LN after.
Pre-LN normalizes before attention (more stable, simpler), Post-LN after (potentially better quality) – the architecture decision that stabilizes or crashes LLM training.
Explanation
Post-LN (original Transformer): x → Attention → Add(x) → LN. Pre-LN (GPT-2+): x → LN → Attention → Add(x). Pre-LN trains more stably (no warmup needed), Post-LN often converges to better quality with careful training. Modern LLMs almost all use Pre-LN with RMSNorm.
Marketing Relevance
The choice of Pre-LN vs Post-LN fundamentally affects training stability, required learning rate, and final model quality.
Common Pitfalls
Pre-LN can lead to representation collapse. Post-LN needs learning rate warmup. Incorrect switching can destabilize training.
Origin & History
The original Transformer (2017) used Post-LN. Xiong et al. (2020) showed Pre-LN trains more stably. GPT-2 (OpenAI, 2019) was one of the first large models with Pre-LN. Today: LLaMA, Mistral, Gemma use Pre-RMSNorm.
Comparisons & Differences
Pre-LN vs. Post-LN vs. RMSNorm
Pre-LN/Post-LN describes the position; RMSNorm simplifies the normalization itself (RMS only instead of mean+variance) – both decisions are orthogonal.