LARS (Layer-wise Adaptive Rate Scaling)
Optimizer that combines SGD with layer-wise learning rate adaptation – enables stable training with large batch sizes for computer vision.
LARS scales SGD updates per layer based on weight/gradient norm – standard for large-batch vision training (ResNet with batch 32K).
Explanation
LARS computes a trust ratio per layer: weight norm / gradient norm. Large layers with small gradients get larger steps and vice versa.
Marketing Relevance
LARS enables vision training (ResNet) with batch size 32K without quality loss. Predecessor of LAMB.
Common Pitfalls
Based on SGD (no 2nd order momentum). Less suitable for NLP/Transformers than LAMB. Trust ratio can be unstable for small layers.
Origin & History
You, Gitman & Ginsburg (2017) developed LARS for large batch training at NVIDIA. It showed that layer-wise scaling solves the "large batch problem." LARS inspired LAMB for Adam-based optimizers.
Comparisons & Differences
LARS (Layer-wise Adaptive Rate Scaling) vs. SGD mit Momentum
SGD uses a global LR; LARS scales per layer – enables 10-100x larger batches without divergence.