Temporal Difference Learning (TD)
TD learning updates value estimates based on the difference between successive predictions – learns from incomplete episodes through bootstrapping.
TD learning learns through bootstrapping: values are updated step-by-step from the difference between prediction and next step – foundation of Q-Learning and DQN.
Explanation
Instead of waiting for the episode to end (Monte Carlo), TD updates after each step: V(s) ← V(s) + α[r + γV(s') - V(s)]. The error term (TD error) drives learning.
Marketing Relevance
TD learning is the mathematical foundation of Q-Learning and thus DQN, which mastered Atari – fundamental RL concept.
Common Pitfalls
Bootstrapping can propagate errors. Bias-variance tradeoff with TD(λ). Convergence only guaranteed with correct learning rate.
Origin & History
Sutton (1988) formalized TD learning. TD-Gammon (Tesauro, 1992) was an early success (backgammon). TD methods became the foundation for Q-Learning (1989) and all modern value-based RL algorithms.
Comparisons & Differences
Temporal Difference Learning (TD) vs. Monte Carlo Methods
Monte Carlo waits for episode end for exact returns; TD bootstraps after each step – faster learning but more bias.