Nesterov Accelerated Gradient (NAG)
Improved momentum variant that computes the gradient at a "look-ahead" point instead of the current one – faster and more stable convergence.
Nesterov momentum looks ahead and corrects direction before it goes wrong – theoretically faster convergence than standard momentum.
Explanation
Standard momentum: first gradient, then step. Nesterov: first step (based on momentum), then gradient at the new point. This "look-ahead" corrects the direction before it goes wrong.
Marketing Relevance
Nesterov momentum is standard in SGD for computer vision and offers better convergence guarantees than classical momentum.
Common Pitfalls
Only marginally better than classical momentum in practice. Less relevant in Adam since Adam has its own adaptive mechanisms.
Origin & History
Yurii Nesterov published the method in 1983 as "Accelerated Gradient Method" with provably better convergence rate. Sutskever et al. (2013) adapted it for deep learning. PyTorch implements Nesterov as a flag in SGD.
Comparisons & Differences
Nesterov Accelerated Gradient (NAG) vs. Klassisches Momentum
Classical momentum computes gradient at current point; Nesterov at look-ahead point – better correction at direction changes.
Nesterov Accelerated Gradient (NAG) vs. Adam
Adam has built-in momentum (1st moment) plus adaptive learning rates. Nesterov variants of Adam (NAdam) exist but are rarely needed.