NAdam (Nesterov-Accelerated Adam)
Optimizer that integrates Nesterov momentum into Adam – combines NAG's look-ahead correction with Adam's adaptive learning rates.
NAdam integrates Nesterov look-ahead into Adam – theoretically faster convergence but only marginally better than AdamW in practice.
Explanation
NAdam modifies Adam's momentum term so the gradient is computed at the "look-ahead" point instead of the current one. This can bring faster convergence and better generalization.
Marketing Relevance
NAdam is a theoretically well-founded improvement of Adam but used less frequently in practice than AdamW. Relevant for researchers and benchmarks.
Common Pitfalls
Marginally better than Adam in practice. AdamW remains standard. Adam hyperparameters are not directly transferable.
Origin & History
Dozat (2016) proposed NAdam as an elegant integration of Nesterov momentum into Adam. Despite being theoretically superior, NAdam could not establish itself over AdamW as the standard.
Comparisons & Differences
NAdam (Nesterov-Accelerated Adam) vs. Adam
Adam uses classical momentum (1st moment); NAdam uses Nesterov momentum with look-ahead correction.
NAdam (Nesterov-Accelerated Adam) vs. AdamW
AdamW fixed weight decay; NAdam fixed momentum computation. Both solve different Adam weaknesses.