Momentum
Acceleration technique for gradient descent that accumulates past gradient directions to converge faster and escape local minima.
Momentum accelerates SGD by accumulating past gradients – like a ball rolling downhill that overcomes small hills (local minima). Default value: 0.9.
Explanation
Momentum adds a weighted fraction of the previous update to the current one. Like a rolling ball: it accelerates in consistent directions and overcomes small hills.
Marketing Relevance
Momentum is a standard component of all modern optimizers (SGD+Momentum, Adam). Typical value: 0.9.
Common Pitfalls
Too high momentum value can overshoot the minimum. Consider interaction with learning rate.
Origin & History
Boris Polyak introduced the heavy-ball method in 1964. Nesterov momentum (1983) looks ahead and improves convergence. Momentum was integrated into Adam (2015) as the first moment.
Comparisons & Differences
Momentum vs. Nesterov Momentum
Standard momentum computes gradient at current point; Nesterov computes at "look-ahead" point – better convergence.
Momentum vs. Adam (Adaptive Moment)
Momentum uses only the first moment (mean of gradients); Adam also uses the second moment (variance) for adaptive learning rates.