Stochastic Weight Averaging (SWA)
Training technique that averages model weights over multiple checkpoints to find flatter minima and better generalization.
SWA averages weights over training checkpoints – free generalization improvement without inference overhead, finds flatter minima.
Explanation
After normal training, training continues with a cyclical or constant LR and weights are averaged. The ensemble result typically lies in a flatter region of the loss landscape.
Marketing Relevance
SWA is a free generalization improvement – no additional inference cost (one model), just slightly more training.
Common Pitfalls
Batch normalization must be recomputed after averaging. Not always effective on already optimally tuned models.
Origin & History
Izmailov et al. (2018) showed that simple weight averaging at the end of training consistently delivers better generalization. PyTorch integrated SWA as an official optimizer extension.
Comparisons & Differences
Stochastic Weight Averaging (SWA) vs. Model Ensemble
Ensemble: multiple models at inference (N× cost). SWA: one averaged model at inference (1× cost, similar effect).
Stochastic Weight Averaging (SWA) vs. EMA (Exponential Moving Average)
SWA averages discrete checkpoints equally weighted; EMA averages continuously with exponential decay – EMA is simpler to implement.