AdaGrad
Optimizer that adaptively adjusts the learning rate per parameter – frequently updated parameters get smaller rates, rare ones get larger.
AdaGrad adapts learning rates per parameter: rare features get larger updates. First adaptive method, but the monotonically decreasing LR makes it unsuitable for deep networks.
Explanation
AdaGrad accumulates squared gradients and scales the learning rate inversely. Good for sparse data (NLP, recommendation systems), but the LR decreases monotonically and can drop to zero too early.
Marketing Relevance
AdaGrad was the first adaptive optimizer and inspired RMSprop and Adam. Still relevant today for sparse features (embeddings, recommendation systems).
Common Pitfalls
Learning rate decreases monotonically to zero – training effectively stops. Usually too aggressive for deep networks. Prefer RMSprop/Adam.
Origin & History
Duchi, Hazan & Singer published AdaGrad in 2011. It was the breakthrough for adaptive learning rates but was quickly superseded by RMSprop (Hinton, 2012) and Adam (2014), which solve the monotonically decreasing LR problem.
Comparisons & Differences
AdaGrad vs. RMSprop
AdaGrad accumulates all past gradients (LR → 0); RMSprop uses exponential average and forgets old gradients – more stable LR.
AdaGrad vs. Adam
Adam combines RMSprop (adaptive LR) with momentum (gradient mean). AdaGrad has no momentum and a monotonically decreasing LR.