Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Exponential Moving Average (EMA)

    Also known as:
    EMA
    Polyak Averaging
    Model EMA
    Weight EMA
    Updated: 2/12/2026

    Technique that maintains an exponentially weighted average of model weights over training – the EMA model often generalizes better than the final model.

    Quick Summary

    EMA maintains a moving average of model weights – standard for diffusion models and self-supervised learning, delivers more robust inference weights.

    Explanation

    EMA weights: θ_ema = α × θ_ema + (1-α) × θ_current. Typical α=0.999 or 0.9999. The EMA model is only used for evaluation/inference, not for training itself.

    Marketing Relevance

    EMA is standard for diffusion models (Stable Diffusion), ViTs and increasingly for LLMs. DINO and BYOL use EMA as "teacher" in self-supervised learning.

    Common Pitfalls

    Additional memory for EMA weights (2× parameters). Decay rate must be tuned. BN stats must be computed separately.

    Origin & History

    Polyak & Juditsky (1992) proposed weight averaging for faster convergence. EMA became essential for self-supervised learning (BYOL 2020, DINO 2021) and diffusion models. Standard in nearly all generative models today.

    Comparisons & Differences

    Exponential Moving Average (EMA) vs. SWA (Stochastic Weight Averaging)

    EMA averages continuously with exponential decay; SWA averages discrete checkpoints. EMA is simpler, SWA has theoretically broader averaging.

    Exponential Moving Average (EMA) vs. Checkpoint Ensemble

    Ensemble uses multiple checkpoints at inference (expensive); EMA produces a single model with similar smoothing (cheap).

    Related Services

    Related Terms

    👋Questions? Chat with us!