Cyclical Learning Rate (CLR)
Learning rate schedule that cyclically varies the LR between a minimum and maximum – prevents stagnation and helps overcome saddle points.
Cyclical learning rates vary the LR periodically between min and max – prevents stagnation and was the predecessor of the one-cycle policy.
Explanation
The LR rises and falls in triangular, trapezoidal, or cosine cycles. Periodically increasing the LR can "push" the model out of local minima and find better regions.
Marketing Relevance
CLR was the predecessor of the one-cycle policy. Combined with the LR finder, a very effective tuning strategy.
Common Pitfalls
Cycle length and LR range must be determined with LR finder. Less common for LLM pre-training than warmup+cosine decay.
Origin & History
Leslie Smith (2017) introduced CLR in "Cyclical Learning Rates for Training Neural Networks." The method showed that periodically increasing LR helps find better solutions. Smith developed the one-cycle policy and LR finder from this.
Comparisons & Differences
Cyclical Learning Rate (CLR) vs. One-Cycle Policy
CLR has multiple cycles; one-cycle uses exactly one cycle for the entire training – more aggressive and often more effective.
Cyclical Learning Rate (CLR) vs. Cosine Annealing mit Warm Restarts
CLR uses linear triangular cycles; SGDR uses cosine cycles with optional restart. Similar principle, different curve shape.