Gradient Noise
The natural noise in gradient estimates from mini-batch sampling – acts as implicit regularization and helps find better minima.
Gradient noise from mini-batch sampling is not a bug but a feature: it acts as natural regularization and helps SGD find flatter, better minima.
Explanation
Each mini-batch provides a noisy estimate of the true gradient. This noise helps "escape" sharp minima and find flatter, better-generalizing solutions.
Marketing Relevance
Gradient noise explains why smaller batch sizes often generalize better and why SGD finds flatter minima than full-batch GD.
Common Pitfalls
Too much noise (too small batches) prevents convergence. Too little noise (too large batches) can worsen generalization.
Origin & History
The regularizing effect of SGD noise was intensively researched from 2015. Keskar et al. (2017) showed that large batches lead to sharp minima. Smith & Le (2018) formalized SGD noise as Bayesian inference.
Comparisons & Differences
Gradient Noise vs. Dropout
Dropout adds explicit noise to activations (regularization by design); gradient noise arises naturally through mini-batch sampling.
Gradient Noise vs. Gradient Clipping
Gradient clipping limits gradient magnitude (against exploding); gradient noise describes natural variance (feature, not problem).