Gradient Centralization (GC)
Simple technique that subtracts the mean of gradients before applying them to weights – improves generalization at zero cost.
Gradient centralization subtracts the mean of gradients – free regularization with one line of code, consistently improves generalization.
Explanation
GC centers gradients around zero: g = g − mean(g). This implicitly regularizes weight norms and has a similar effect to weight decay without its hyperparameters.
Marketing Relevance
GC can be layered on any optimizer (1 line of code!) and consistently improves generalization. Zero-cost regularization.
Common Pitfalls
Not suitable for all layer types (exclude bias vectors). Effect less studied for large models. Combination with weight decay can be redundant.
Origin & History
Yong et al. (2020) showed that this trivial operation (gradient − mean) brings consistent improvements across diverse tasks. The paper "Gradient Centralization: A New Optimization Technique for Deep Neural Networks" was presented at ECCV 2020.
Comparisons & Differences
Gradient Centralization (GC) vs. Weight Decay
Weight decay penalizes large weights explicitly; GC regularizes weight norms implicitly through gradient centering – similar effect, different mechanism.
Gradient Centralization (GC) vs. Batch Normalization
BN normalizes activations (forward pass); GC normalizes gradients (backward pass). Both stabilize training in different ways.