Stochastic Gradient Descent (SGD)
Variant of gradient descent that uses only a mini-batch per update instead of all data – faster and often better generalizing.
SGD uses mini-batches instead of all data per update – faster than batch GD and the noise acts as natural regularization. With momentum, it is the gold standard for vision models.
Explanation
SGD approximates the true gradient with a mini-batch. The resulting noise acts as implicit regularization and helps find flatter minima.
Marketing Relevance
SGD with momentum remains the gold standard for computer vision (ResNet, ViT). Adam dominates in NLP/LLMs, but SGD often generalizes better.
Common Pitfalls
Slow convergence without momentum. Sensitive to learning rate. Manual learning rate schedules needed.
Origin & History
Robbins & Monro (1951) founded stochastic approximation. Mini-batch SGD became practical with GPUs in the 2010s. SGD with momentum (Polyak, 1964) and the Nesterov variant remained dominant optimizers for decades.
Comparisons & Differences
Stochastic Gradient Descent (SGD) vs. Adam Optimizer
SGD uses a global learning rate; Adam adapts per parameter. SGD often generalizes better, Adam converges faster.
Stochastic Gradient Descent (SGD) vs. Full-Batch Gradient Descent
Full-batch uses all data (deterministic, slow); SGD uses mini-batches (stochastic, fast, regularizing).