Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Stochastic Weight Averaging (SWA)

    Also known as:
    SWA
    Weight Averaging
    SWA Training
    Updated: 2/12/2026

    Training technique that averages model weights over multiple checkpoints to find flatter minima and better generalization.

    Quick Summary

    SWA averages weights over training checkpoints – free generalization improvement without inference overhead, finds flatter minima.

    Explanation

    After normal training, training continues with a cyclical or constant LR and weights are averaged. The ensemble result typically lies in a flatter region of the loss landscape.

    Marketing Relevance

    SWA is a free generalization improvement – no additional inference cost (one model), just slightly more training.

    Common Pitfalls

    Batch normalization must be recomputed after averaging. Not always effective on already optimally tuned models.

    Origin & History

    Izmailov et al. (2018) showed that simple weight averaging at the end of training consistently delivers better generalization. PyTorch integrated SWA as an official optimizer extension.

    Comparisons & Differences

    Stochastic Weight Averaging (SWA) vs. Model Ensemble

    Ensemble: multiple models at inference (N× cost). SWA: one averaged model at inference (1× cost, similar effect).

    Stochastic Weight Averaging (SWA) vs. EMA (Exponential Moving Average)

    SWA averages discrete checkpoints equally weighted; EMA averages continuously with exponential decay – EMA is simpler to implement.

    Related Services

    Related Terms

    👋Questions? Chat with us!