Sharpness-Aware Minimization (SAM)
Optimization method that minimizes not only the loss but also the "sharpness" of the loss landscape – finds flatter minima for better generalization.
SAM specifically seeks flat minima through adversarial perturbation – better generalization at the cost of 2x compute per step.
Explanation
SAM performs two forward passes per step: first an adversarial step toward maximum loss increase, then optimization at that point. Result: parameters land in flat, robust regions.
Marketing Relevance
SAM significantly improves generalization in vision models. Google uses SAM for ViT training. Especially effective with limited data.
Common Pitfalls
2x compute cost from double forward pass. ASAM (Adaptive SAM) reduces overhead. Not always worthwhile for LLM training.
Origin & History
Foret et al. (Google, 2021) published SAM showing consistent improvements across diverse benchmarks. ASAM (Kwon et al., 2021) made SAM adaptive. SAM became standard for Google's ViT training.
Comparisons & Differences
Sharpness-Aware Minimization (SAM) vs. AdamW
AdamW minimizes only loss; SAM minimizes loss AND landscape sharpness. SAM can be layered on AdamW (SAM + AdamW).
Sharpness-Aware Minimization (SAM) vs. Stochastic Weight Averaging (SWA)
SWA averages checkpoints for flatter solutions post-hoc; SAM actively seeks flat minima during training.