Learning Rate Warmup
Training technique that slowly ramps the learning rate from near zero to the target value in the first steps/epochs.
Warmup starts with a tiny learning rate and gradually increases it – prevents training explosions with randomly initialized weights. Standard in LLM training.
Explanation
Warmup prevents unstable training at the start when weights are still randomly initialized and produce large gradients.
Marketing Relevance
Warmup is essential for LLM training, fine-tuning, and training with large batch sizes. Typical: 1-5% of total steps.
Common Pitfalls
Too long warmup wastes training budget. Too short can cause instability. Warmup duration scales with batch size.
Origin & History
Goyal et al. (2017, Facebook) showed that warmup is essential for training with large batch sizes ("Accurate, Large Minibatch SGD"). Standard component of every LLM training recipe since then.
Comparisons & Differences
Learning Rate Warmup vs. Cosine Annealing
Warmup increases LR at the start; cosine annealing decreases it afterward. Together they form the standard schedule: warmup → cosine decay.
Learning Rate Warmup vs. Constant Learning Rate
Without warmup, training at high LR can immediately diverge. Warmup gives the optimizer time to adapt to the loss landscape.