RMSprop
Adaptive optimizer that solves AdaGrad's problem by using an exponentially weighted average of squared gradients instead of their sum.
RMSprop fixed AdaGrad's monotonically decreasing learning rate through exponential forgetting of old gradients – predecessor of Adam and never formally published.
Explanation
RMSprop "forgets" old gradients and focuses on the current state. The learning rate doesn't monotonically decrease to zero and remains trainable. Hinton presented it in a Coursera lecture – never formally published.
Marketing Relevance
RMSprop was the most popular adaptive optimizer before Adam. Still relevant as a building block of Adam and for RL tasks.
Common Pitfalls
No momentum term (unlike Adam). Never formally published – only described in lecture slides. Replaced by AdamW for LLM training.
Origin & History
Geoffrey Hinton presented RMSprop in 2012 in his Coursera Neural Network Lectures – without formal publication. It still became the standard optimizer until Adam (2014) unified both ideas (adaptive LR + momentum).
Comparisons & Differences
RMSprop vs. AdaGrad
AdaGrad accumulates without limit (LR → 0); RMSprop uses exponential average – maintains a usable learning rate.
RMSprop vs. Adam
RMSprop has only adaptive learning rates (2nd moment); Adam adds momentum (1st moment). Adam is "RMSprop + momentum".