AdamW
Corrected variant of the Adam optimizer that decouples weight decay from the gradient update – the de facto standard for LLM and transformer training.
AdamW fixes Adam's incorrect weight decay implementation by decoupling it from the gradient – the standard optimizer for all modern LLMs and transformers.
Explanation
In Adam, weight decay is incorrectly applied as L2 regularization on the gradient. AdamW separates weight decay and applies it directly to the weights, resulting in more correct behavior with adaptive learning rates.
Marketing Relevance
AdamW is the standard optimizer for GPT, LLaMA, BERT, and virtually all modern LLMs. No LLM training without AdamW.
Common Pitfalls
Weight decay value must be tuned (typical: 0.01–0.1). Confusing with Adam + L2 leads to suboptimal training.
Origin & History
Loshchilov & Hutter published "Decoupled Weight Decay Regularization" in 2017/2019, showing that Adam's L2 regularization is incorrect with adaptive rates. AdamW immediately became standard for BERT (2018), GPT-2 (2019), and all subsequent LLMs.
Comparisons & Differences
AdamW vs. Adam
Adam applies weight decay as L2 on gradients (mathematically wrong with adaptive rates). AdamW decouples weight decay – correct and better generalizing.
AdamW vs. SGD mit Momentum
With SGD, L2 and weight decay are identical. With Adam/AdamW they are not – hence the fix. AdamW converges faster, SGD sometimes generalizes better.