Cosine Annealing
A learning rate schedule strategy that gently reduces the learning rate from a maximum value to near zero following a cosine curve.
Cosine annealing lowers the learning rate in a cosine curve – standard schedule for LLM training and vision models, gentler than step decay.
Explanation
Cosine annealing reduces LR more gently than step decay and enables late fine-tuning with very small rates. Warm restarts periodically reset the LR.
Marketing Relevance
Cosine annealing is the de facto standard for LLM pre-training and vision models. Almost all modern training recipes use it.
Common Pitfalls
Total steps must be known in advance. Warm restarts require tuning of cycle length. Not always better than linear decay.
Origin & History
Loshchilov & Hutter (2017) introduced SGDR (SGD with Warm Restarts), combining cosine annealing with periodic restarts. The Chinchilla paper (2022) used cosine decay for optimal LLM training. Standard since then.
Comparisons & Differences
Cosine Annealing vs. Step Decay
Step decay reduces LR abruptly at fixed intervals; cosine annealing lowers it smoothly and continuously.
Cosine Annealing vs. Linear Decay
Linear decay lowers LR uniformly; cosine annealing decreases slower initially, then faster – maintains a higher LR longer.