Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence
    (LARS)

    LARS (Layer-wise Adaptive Rate Scaling)

    Also known as:
    LARS Optimizer
    Layer-wise Adaptive Rate Scaling
    LARC
    Updated: 2/12/2026

    Optimizer that combines SGD with layer-wise learning rate adaptation – enables stable training with large batch sizes for computer vision.

    Quick Summary

    LARS scales SGD updates per layer based on weight/gradient norm – standard for large-batch vision training (ResNet with batch 32K).

    Explanation

    LARS computes a trust ratio per layer: weight norm / gradient norm. Large layers with small gradients get larger steps and vice versa.

    Marketing Relevance

    LARS enables vision training (ResNet) with batch size 32K without quality loss. Predecessor of LAMB.

    Common Pitfalls

    Based on SGD (no 2nd order momentum). Less suitable for NLP/Transformers than LAMB. Trust ratio can be unstable for small layers.

    Origin & History

    You, Gitman & Ginsburg (2017) developed LARS for large batch training at NVIDIA. It showed that layer-wise scaling solves the "large batch problem." LARS inspired LAMB for Adam-based optimizers.

    Comparisons & Differences

    LARS (Layer-wise Adaptive Rate Scaling) vs. SGD mit Momentum

    SGD uses a global LR; LARS scales per layer – enables 10-100x larger batches without divergence.

    Related Services

    Related Terms

    👋Questions? Chat with us!