Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Pre-LN vs. Post-LN

    Also known as:
    Pre-Layer Normalization
    Post-Layer Normalization
    LN Placement
    Norm Position
    Updated: 2/10/2026

    Refers to the placement of layer normalization in Transformer blocks: Pre-LN normalizes before attention/FFN, Post-LN after.

    Quick Summary

    Pre-LN normalizes before attention (more stable, simpler), Post-LN after (potentially better quality) – the architecture decision that stabilizes or crashes LLM training.

    Explanation

    Post-LN (original Transformer): x → Attention → Add(x) → LN. Pre-LN (GPT-2+): x → LN → Attention → Add(x). Pre-LN trains more stably (no warmup needed), Post-LN often converges to better quality with careful training. Modern LLMs almost all use Pre-LN with RMSNorm.

    Marketing Relevance

    The choice of Pre-LN vs Post-LN fundamentally affects training stability, required learning rate, and final model quality.

    Common Pitfalls

    Pre-LN can lead to representation collapse. Post-LN needs learning rate warmup. Incorrect switching can destabilize training.

    Origin & History

    The original Transformer (2017) used Post-LN. Xiong et al. (2020) showed Pre-LN trains more stably. GPT-2 (OpenAI, 2019) was one of the first large models with Pre-LN. Today: LLaMA, Mistral, Gemma use Pre-RMSNorm.

    Comparisons & Differences

    Pre-LN vs. Post-LN vs. RMSNorm

    Pre-LN/Post-LN describes the position; RMSNorm simplifies the normalization itself (RMS only instead of mean+variance) – both decisions are orthogonal.

    Related Services

    Related Terms

    👋Questions? Chat with us!