What is Stochastic Gradient Descent (SGD)?

Variant of gradient descent that uses only a mini-batch per update instead of all data – faster and often better generalizing. SGD approximates the true gradient with a mini-batch. The resulting noise acts as implicit regularization and helps find flatter minima.

What is the difference between Stochastic Gradient Descent (SGD) and Gradient Descent?

Stochastic Gradient Descent (SGD) and Gradient Descent are related concepts in AI and marketing. Variant of gradient descent that uses only a mini-batch per update instead of all data – faster and ...

Artificial Intelligence

(Stochastischer Gradientenabstieg)

Stochastic Gradient Descent (SGD)

Also known as:

SGD

Stochastic GD

Mini-Batch SGD

Updated: 2/10/2026

Variant of gradient descent that uses only a mini-batch per update instead of all data – faster and often better generalizing.

Quick Summary

SGD uses mini-batches instead of all data per update – faster than batch GD and the noise acts as natural regularization. With momentum, it is the gold standard for vision models.

Explanation

SGD approximates the true gradient with a mini-batch. The resulting noise acts as implicit regularization and helps find flatter minima.

Marketing Relevance

SGD with momentum remains the gold standard for computer vision (ResNet, ViT). Adam dominates in NLP/LLMs, but SGD often generalizes better.

Common Pitfalls

Slow convergence without momentum. Sensitive to learning rate. Manual learning rate schedules needed.

Origin & History

Robbins & Monro (1951) founded stochastic approximation. Mini-batch SGD became practical with GPUs in the 2010s. SGD with momentum (Polyak, 1964) and the Nesterov variant remained dominant optimizers for decades.

Comparisons & Differences

Stochastic Gradient Descent (SGD) vs. Adam Optimizer

SGD uses a global learning rate; Adam adapts per parameter. SGD often generalizes better, Adam converges faster.

Stochastic Gradient Descent (SGD) vs. Full-Batch Gradient Descent

Full-batch uses all data (deterministic, slow); SGD uses mini-batches (stochastic, fast, regularizing).

Further Resources

Related Services

Strategy & Intelligence Tech & Integration Consulting

View all terms