Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence
    (Stochastischer Gradientenabstieg)

    Stochastic Gradient Descent (SGD)

    Also known as:
    SGD
    Stochastic GD
    Mini-Batch SGD
    Updated: 2/10/2026

    Variant of gradient descent that uses only a mini-batch per update instead of all data – faster and often better generalizing.

    Quick Summary

    SGD uses mini-batches instead of all data per update – faster than batch GD and the noise acts as natural regularization. With momentum, it is the gold standard for vision models.

    Explanation

    SGD approximates the true gradient with a mini-batch. The resulting noise acts as implicit regularization and helps find flatter minima.

    Marketing Relevance

    SGD with momentum remains the gold standard for computer vision (ResNet, ViT). Adam dominates in NLP/LLMs, but SGD often generalizes better.

    Common Pitfalls

    Slow convergence without momentum. Sensitive to learning rate. Manual learning rate schedules needed.

    Origin & History

    Robbins & Monro (1951) founded stochastic approximation. Mini-batch SGD became practical with GPUs in the 2010s. SGD with momentum (Polyak, 1964) and the Nesterov variant remained dominant optimizers for decades.

    Comparisons & Differences

    Stochastic Gradient Descent (SGD) vs. Adam Optimizer

    SGD uses a global learning rate; Adam adapts per parameter. SGD often generalizes better, Adam converges faster.

    Stochastic Gradient Descent (SGD) vs. Full-Batch Gradient Descent

    Full-batch uses all data (deterministic, slow); SGD uses mini-batches (stochastic, fast, regularizing).

    Related Services

    Related Terms

    👋Questions? Chat with us!