Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence
    (Gradient Centralization)

    Gradient Centralization (GC)

    Also known as:
    GC
    Centralized Gradients
    Mean-Subtracted Gradients
    Updated: 2/12/2026

    Simple technique that subtracts the mean of gradients before applying them to weights – improves generalization at zero cost.

    Quick Summary

    Gradient centralization subtracts the mean of gradients – free regularization with one line of code, consistently improves generalization.

    Explanation

    GC centers gradients around zero: g = g − mean(g). This implicitly regularizes weight norms and has a similar effect to weight decay without its hyperparameters.

    Marketing Relevance

    GC can be layered on any optimizer (1 line of code!) and consistently improves generalization. Zero-cost regularization.

    Common Pitfalls

    Not suitable for all layer types (exclude bias vectors). Effect less studied for large models. Combination with weight decay can be redundant.

    Origin & History

    Yong et al. (2020) showed that this trivial operation (gradient − mean) brings consistent improvements across diverse tasks. The paper "Gradient Centralization: A New Optimization Technique for Deep Neural Networks" was presented at ECCV 2020.

    Comparisons & Differences

    Gradient Centralization (GC) vs. Weight Decay

    Weight decay penalizes large weights explicitly; GC regularizes weight norms implicitly through gradient centering – similar effect, different mechanism.

    Gradient Centralization (GC) vs. Batch Normalization

    BN normalizes activations (forward pass); GC normalizes gradients (backward pass). Both stabilize training in different ways.

    Related Services

    Related Terms

    👋Questions? Chat with us!