What is the difference between LAMB (Layer-wise Adaptive Moments for Batch Training) and Adam Optimizer?

LAMB (Layer-wise Adaptive Moments for Batch Training) and Adam Optimizer are related concepts in AI and marketing. Optimizer for extremely large batch sizes (up to 64K+) that adapts learning rates per layer, enablin...

Artificial Intelligence

(LAMB)

LAMB (Layer-wise Adaptive Moments for Batch Training)

Also known as:

LAMB Optimizer

Layer-wise Adaptive Moments for Batch Training

Updated: 2/12/2026

Optimizer for extremely large batch sizes (up to 64K+) that adapts learning rates per layer, enabling stable training with massive parallelization.

Quick Summary

LAMB adapts learning rates per layer for extremely large batches – enabled BERT training in 76 minutes instead of 3 days.

Explanation

LAMB scales updates per layer based on the ratio of weight norm to gradient norm. This allows enormous batch size increases without losing training quality – ideal for fast pre-training runs.

Marketing Relevance

LAMB enabled BERT training in 76 minutes instead of 3 days. Essential for cost-effective training with large GPU clusters.

Common Pitfalls

Only useful with very large batch sizes. No advantage over AdamW with small batches. Per-layer hyperparameter tuning can be complex.

Origin & History

You et al. (2020) developed LAMB at Google to train BERT with batch size 64K. Training time dropped from 3 days to 76 minutes. LAMB combines Adam with layer-wise trust ratio (inspired by LARS).