Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    AdaGrad

    Also known as:
    Adaptive Gradient Algorithm
    AdaGrad Optimizer
    Updated: 2/10/2026

    Optimizer that adaptively adjusts the learning rate per parameter – frequently updated parameters get smaller rates, rare ones get larger.

    Quick Summary

    AdaGrad adapts learning rates per parameter: rare features get larger updates. First adaptive method, but the monotonically decreasing LR makes it unsuitable for deep networks.

    Explanation

    AdaGrad accumulates squared gradients and scales the learning rate inversely. Good for sparse data (NLP, recommendation systems), but the LR decreases monotonically and can drop to zero too early.

    Marketing Relevance

    AdaGrad was the first adaptive optimizer and inspired RMSprop and Adam. Still relevant today for sparse features (embeddings, recommendation systems).

    Common Pitfalls

    Learning rate decreases monotonically to zero – training effectively stops. Usually too aggressive for deep networks. Prefer RMSprop/Adam.

    Origin & History

    Duchi, Hazan & Singer published AdaGrad in 2011. It was the breakthrough for adaptive learning rates but was quickly superseded by RMSprop (Hinton, 2012) and Adam (2014), which solve the monotonically decreasing LR problem.

    Comparisons & Differences

    AdaGrad vs. RMSprop

    AdaGrad accumulates all past gradients (LR → 0); RMSprop uses exponential average and forgets old gradients – more stable LR.

    AdaGrad vs. Adam

    Adam combines RMSprop (adaptive LR) with momentum (gradient mean). AdaGrad has no momentum and a monotonically decreasing LR.

    Related Services

    Related Terms

    👋Questions? Chat with us!