Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence
    (Warmup)

    Learning Rate Warmup

    Also known as:
    LR Warmup
    Warm-Up Phase
    Gradual Warmup
    Updated: 2/10/2026

    Training technique that slowly ramps the learning rate from near zero to the target value in the first steps/epochs.

    Quick Summary

    Warmup starts with a tiny learning rate and gradually increases it – prevents training explosions with randomly initialized weights. Standard in LLM training.

    Explanation

    Warmup prevents unstable training at the start when weights are still randomly initialized and produce large gradients.

    Marketing Relevance

    Warmup is essential for LLM training, fine-tuning, and training with large batch sizes. Typical: 1-5% of total steps.

    Common Pitfalls

    Too long warmup wastes training budget. Too short can cause instability. Warmup duration scales with batch size.

    Origin & History

    Goyal et al. (2017, Facebook) showed that warmup is essential for training with large batch sizes ("Accurate, Large Minibatch SGD"). Standard component of every LLM training recipe since then.

    Comparisons & Differences

    Learning Rate Warmup vs. Cosine Annealing

    Warmup increases LR at the start; cosine annealing decreases it afterward. Together they form the standard schedule: warmup → cosine decay.

    Learning Rate Warmup vs. Constant Learning Rate

    Without warmup, training at high LR can immediately diverge. Warmup gives the optimizer time to adapt to the loss landscape.

    Related Services

    Related Terms

    👋Questions? Chat with us!