Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    AdamW

    Also known as:
    AdamW Optimizer
    Decoupled Weight Decay Regularization
    Fixed Adam
    Updated: 2/10/2026

    Corrected variant of the Adam optimizer that decouples weight decay from the gradient update – the de facto standard for LLM and transformer training.

    Quick Summary

    AdamW fixes Adam's incorrect weight decay implementation by decoupling it from the gradient – the standard optimizer for all modern LLMs and transformers.

    Explanation

    In Adam, weight decay is incorrectly applied as L2 regularization on the gradient. AdamW separates weight decay and applies it directly to the weights, resulting in more correct behavior with adaptive learning rates.

    Marketing Relevance

    AdamW is the standard optimizer for GPT, LLaMA, BERT, and virtually all modern LLMs. No LLM training without AdamW.

    Common Pitfalls

    Weight decay value must be tuned (typical: 0.01–0.1). Confusing with Adam + L2 leads to suboptimal training.

    Origin & History

    Loshchilov & Hutter published "Decoupled Weight Decay Regularization" in 2017/2019, showing that Adam's L2 regularization is incorrect with adaptive rates. AdamW immediately became standard for BERT (2018), GPT-2 (2019), and all subsequent LLMs.

    Comparisons & Differences

    AdamW vs. Adam

    Adam applies weight decay as L2 on gradients (mathematically wrong with adaptive rates). AdamW decouples weight decay – correct and better generalizing.

    AdamW vs. SGD mit Momentum

    With SGD, L2 and weight decay are identical. With Adam/AdamW they are not – hence the fix. AdamW converges faster, SGD sometimes generalizes better.

    Related Services

    Related Terms

    👋Questions? Chat with us!