Exponential Moving Average (EMA)
Technique that maintains an exponentially weighted average of model weights over training – the EMA model often generalizes better than the final model.
EMA maintains a moving average of model weights – standard for diffusion models and self-supervised learning, delivers more robust inference weights.
Explanation
EMA weights: θ_ema = α × θ_ema + (1-α) × θ_current. Typical α=0.999 or 0.9999. The EMA model is only used for evaluation/inference, not for training itself.
Marketing Relevance
EMA is standard for diffusion models (Stable Diffusion), ViTs and increasingly for LLMs. DINO and BYOL use EMA as "teacher" in self-supervised learning.
Common Pitfalls
Additional memory for EMA weights (2× parameters). Decay rate must be tuned. BN stats must be computed separately.
Origin & History
Polyak & Juditsky (1992) proposed weight averaging for faster convergence. EMA became essential for self-supervised learning (BYOL 2020, DINO 2021) and diffusion models. Standard in nearly all generative models today.
Comparisons & Differences
Exponential Moving Average (EMA) vs. SWA (Stochastic Weight Averaging)
EMA averages continuously with exponential decay; SWA averages discrete checkpoints. EMA is simpler, SWA has theoretically broader averaging.
Exponential Moving Average (EMA) vs. Checkpoint Ensemble
Ensemble uses multiple checkpoints at inference (expensive); EMA produces a single model with similar smoothing (cheap).